Optimizing Batched Small Matrix Multiplication on Multi-core DSP Architecture

Xiaohan Zuo, Chunhua Xiao, Qinglin Wang, Chen Shi

Published: 2024, Last Modified: 22 Jul 2025ISPA 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: General Matrix Multiplication (GEMM) is a critical computational operation in scientific computing and machine learning domains. While traditional GEMM performs well on large matrices, it is inefficient in terms of data transfer and computation for small matrices. Many High-Performance Computing (HPC) tasks can be decomposed into large batches of small matrix multiplication operations. Multi-core Digital Signal Processors (DSPs) are commonly used to accelerate high-performance computing. We present a design for batched fusion small matrix multiplication (BFMM) tailored for multi-core DSP architecture. To address the inefficiencies and redundancy in storage and computational operations associated with batch small matrix multiplications, we designed several strategies. We design a matrix fusion concatenation strategy, an access coordination mechanism, and a mechanism for fragment aggregation. BFMM supports an efficient K-dimension multi-core parallelization strategy. The parameter constraint model makes BFMM highly portable. BFMM also includes a performance evaluation model that facilitates assessment and verification. Experimental results demonstrate that, compared to traditional GEMM (TGEMM) on multi-core DSP and traditional GEMM with concatenated data access (TGEMM Op), BFMM exhibits superior performance. For large batches of small matrices, our design achieves 1.21x to 18x higher performance than TGEMM Op on single-core DSP, while on multi-core DSP, it outperforms TGEMM Op by 1.14x to 18.1x.