Abstract: As the size of deep neural networks (DNNs) continues to grow, their runtime latency also scales. While model pruning and neural architecture search (NAS) can effectively reduce the computation workload, their effectiveness fails to consistently translate into runtime latency reduction. In this article, we identify the root cause behind the mismatch between workload reduction and latency reduction is general processing unit (GPU) tail effect—a classic system issue caused by resource underutilization in the last processing wave of the GPU. We conduct detailed DNN workload characterization and demonstrate the prevalence of GPU tail effect across different DNN architectures, and meanwhile reveal that the unique deep structure and the lightweight layer workload of DNNs exacerbate the tail effect for DNN inference. We then propose a tail-awareness design space enhancement and DNN optimization algorithm to optimize existing NAS and pruning designs and achieve better runtime latency and model accuracy performance. Extensive experiments show 11%–27% latency reduction over SOTA DNN pruning and NAS methods.
External IDs:dblp:journals/tcad/YuXSWSMKLLCC25
Loading