Parallel Scan on Ascend AI Accelerators

Published: 23 Jun 2025, Last Modified: 23 Jun 2025Greeks in AI 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI accelerators, matrix multiplication accelerators, tensor cores, matrix engines, parallel computing, scan
TL;DR: Algorithmic design for AI operators using matrix multiplications, i.e., how to sort using small matrix multiplications?
Abstract: In this presentation, I will explore hardware-aware algorithm design for modern high-performance AI accelerators. Specifically, I will take a fresh look at the well-established parallel primitive known as prefix sum, or scan, and examine the challenges and opportunities that arise from this perspective. We design and implement parallel prefix sum (scan) algorithms using Huawei's Ascend AI accelerators. Ascend accelerators feature specialized computing units—the cube units for efficient matrix multiplication and the vector units for optimized vector operations. A key feature of the proposed scan algorithms is their extensive use of matrix multiplications and accumulations enabled by the cube unit. To showcase the effectiveness of these algorithms, we implement and evaluate several scan-based AI operators commonly used in AI workloads, including sorting, tensor masking, and top-k / top-p (nucleus) sampling. We present a multi-core scan algorithm that fully utilizes both the cube and vector units of Ascend, reaching up to 37.5% of the theoretical memory bandwidth. In addition, our radix sort implementation, which utilizes matrix multiplications for its parallel splits, highlights the potential of matrix engines in enhancing complex AI operations (3.3x speedup over the vector-only baseline). This work will be presented as a main conference poster at the 39th IEEE International Parallel & Distributed Processing Symposium (IPDPS), June 3-7, 2025. - Parallel Scan on Ascend AI Accelerators. Short version to appear in 39th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2025) ( https://www.arxiv.org/abs/2505.15112 ) - A Parallel Scan Algorithm in the Tensor Core Unit Model. In International European Conference on Parallel and Distributed Computing (EuroPar 2023) ( https://doi.org/10.1007/978-3-031-39698-4_33 / https://arxiv.org/abs/2411.17887 )
Submission Number: 16
Loading