DySpMM: From Fix to Dynamic for Sparse Matrix-Matrix Multiplication Accelerators

Hongyi Wang, Kai Zhong, Haoyu Zhang, Shulin Zeng, Zhenhua Zhu, Xinhao Yang, Shuang Wang, Guohao Dai, Huazhong Yang, Yu Wang

Published: 01 Jan 2024, Last Modified: 05 Feb 2025DAC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Sparse Matrix-Matrix Multiplication (SpMM) is one of the key operators in many fields, showing dynamic features in terms of sparsity, element distribution, and data dependency. Previous studies have proposed FPGA-based SpMM accelerators with fixed configurations of on-chip dataflow, leaving three major challenges unsolved: 1) Partitioning matrices with the fixed sub-matrix size to fit limited on-chip buffer on FPGA leads to performance loss because the optimal sub-matrix size to minimize memory access varies with dynamic sparsity. 2) The fixed row-wise allocation scheme of sparse elements in streaming architecture leads to unbalanced workloads because of dynamic element distribution across sparse matrix rows. 3) Read-after-write (RAW) hazard caused by floating-point adder makes the elements in one row cannot be processed consecutively. Architectures with fixed execution order rely on time-consuming pre-processing to deal with dynamic data dependency. Motivated by the observation that fixed configurations lead to performance loss, we propose DySpMM by introducing the dynamic design methodology to SpMM architectures. The configurable data distributor is introduced to enable dynamic sub-matrix size, achieving up to 3.79× less memory access amount. The element-wise allocator is designed for dynamic workload balance, improving utilization up to 3.74×. The interleaved reorder unit is proposed to reorder the elements and dynamically avoid RAW hazards at runtime, avoiding time-consuming pre-processing. We implement DySpMM on U280 FPGA, and the evaluation shows that it achieves 1.42× geomean throughput compared with the state-of-the-art accelerator Sextans and 1.78× energy efficiency compared with V100S GPU.