Abstract: Numerous studies have proposed hardware architectures to accelerate sparse matrix multiplication, but these approaches often incur substantial area and power overhead, significantly compromising their usage in dense scenarios. On the other hand, systolic arrays deliver high efficiency for dense matrix operations, but their application to sparse matrices remains challenging. An ideal design should process both dense and sparse matrices with high efficiency to satisfy performance and versatility requirements.In this paper, we introduce DenSparSA, a balanced systolic array centralized architecture that can execute sparse matrix computations with minimal overhead to original dense matrix computations. DenSparSA supports both single-side and dual-side unstructured sparse matrix multiplications with high efficiency. At the same time, the additional hardware required for managing sparsity is compact and decoupled from the conventional systolic array, allowing for minimal power overhead when switched back to dense matrix operations via circuit gating. The proposed design is implemented with Nangate 45 nm. Implementation results show that DenSparSA achieves a speedup ranging from $1.9 \times$ to $22 \times$ compared to the classic systolic array for sparse workloads, while maintaining relatively low area and power overhead. For dense workloads, the power overhead can be reduced to $\mathbf{1 2 \%}$ for BF16 and 5% for FP32. Compared with existing solutions for sparse acceleration, DenSparSA delivers competitive ($0.82 \times-1.32 \times$) efficiency in sparse scenarios and $1.17 \times-2.28 \times$ better efficiency for dense scenarios, indicating a better balance between both situations.
External IDs:dblp:conf/dac/WangSHMZ25
Loading