Abstract: Sparse matrix-matrix multiplication (SpMM) is a fundamental operation widely used in deep neural networks (DNNs) and high-performance computing. Many compilation studies have optimized the kernel code of SpMM to achieve better performance gains. However, on the one hand, these efforts often focus solely on optimizing individual SpMM operations, without fully considering the influence of preceding and subsequent operators on SpMM. On the other hand, when dense regions in SpMM require accumulation to the same output location, these dense matrix multiplications must be executed sequentially, leading to significant overhead from atomic additions or thread synchronization. In this article, we propose a novel compiler plug-in for efficient SpMM, named SpMMPlu-Pro. SpMMPlu-Pro inherits the sparse intermediate representation (Sparse IR) and sparse pattern representation [meta-operation (meta-op)] as well as five optimization passes from SpMMPlu. To fully utilize the sparse properties, SpMMPlu-Pro implements a forward and backward cross-layer sparsity propagation algorithm, which propagates the sparsity of one layer to the front and back layers, fully releasing the potential of utilizing sparsity to accelerate neural network inference. To alleviate the inefficient accumulation of meta-ops caused by atomic addition or thread synchronization, we propose two complementary scheduling schemes: 1) the segmentation and grouping algorithm based on automatic search and 2) the atomic optimization method through the meta-op data flow graph restructure. We integrated SpMMPlu-Pro into MindSpore and tested its effectiveness and scalability on the NVIDIA V100 GPU and Huawei Ascend 910. The results show that SpMMPlu-Pro supports various sparsity patterns, achieving an average speedup of $4.10\times $ on the V100 GPU and $4.35\times $ on the Ascend 910 compared to the dense counterpart.
Loading