Implementation and Optimization of Sparse BLAS on Kunpeng Processor

27 Feb 2025 (modified: 01 Mar 2025)XJTU 2025 CSUC SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sparse BLAS, Kunpeng 920, ARMv8, NEON optimization, Adaptive storage formats
TL;DR: This paper addresses the performance challenges of sparse Basic Linear Algebra Subprograms (BLAS) on ARM-based Kunpeng 920 processors through architectural adaptation and algorithmic innovation.
Abstract: This paper systematically addresses the performance challenges of sparse Basic Linear Algebra Subprograms (BLAS) on ARM-based Kunpeng 920 processors through architectural adaptation and algorithmic innovation. We develop ACSR (Aligned Compressed Sparse Row) and AELL (Adaptive ELLPACK) storage formats that eliminate zero-padding overhead while maintaining 128-bit memory alignment for NEON vectorization. Combined with NUMA-aware task scheduling and static code analysis guided optimization, our implementation achieves 168.4 GFLOPS in sparse matrix-vector multiplication (SpMV), outperforming OpenBLAS by 37.8% and KML by 29.3% on real-world matrices. Microarchitecture analysis reveals 92.7% L1 cache hit rate and 1.82 instructions per cycle (IPC), demonstrating effective utilization of Kunpeng's 7nm TSV110 cores. This work provides critical insights for building high-performance sparse linear algebra ecosystems on domestic ARM processors.
Submission Number: 16
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview