Abstract: As deep learning models continue to grow larger and more complex, exploiting sparsity is becoming one of the most critical areas for enhancing efficiency and scalability. Several methods for leveraging sparsity have been proposed to more effectively balance the trade-off between compression ratio and accuracy. While these methods offer algorithmic advantages, they also introduce significant hardware overhead due to index-based encoding and decoding. In this paper, we propose CROSS, an end-to-end compilation optimization technique to achieve sparse DNN acceleration using GPU computation kernels. The key insight behind CROSS is to exploit parameter distribution locality and reconcile the “sparse” DNN computation with the high-performance “dense” computation kernels. Specifically, we perform an in-depth analysis of sparse operations in mainstream DNN computing frameworks. We then decompose the sparse workload into multiple components to create highly efficient, specialized operators with different sparsity levels. Additionally, we introduce a novel sparse graph translation technique to facilitate computation kernel processing of the sparse workload. The resulting CROSS framework can accommodate various sparsity patterns and optimization techniques, delivering an average $2.03 \times$ speedup on inference latency compared to seven state-of-the-art solutions with smaller memory footprints across various models and datasets.
Loading