Parallel Scan on Ascend AI Accelerators

22 Jan 2026 (modified: 24 Jun 2026)Submitted to ICML 2026EveryoneRevisionsBibTeXCC BY-NC-ND 4.0
Abstract: Large Language Models (LLMs) depend on model- and architecture-specific optimizations to be efficiently executed at a large scale. The number of new LLM variants grows rapidly, making it necessary to distill and only optimize a few common parallel computational primitives, including prefix sum (scan). In this work, we design high-performance prefix-sum algorithms for the Ascend NPU architecture and explore their applicability in AI workloads and LLMs. The key feature of our algorithms is the efficient use of vector and matrix units in the Ascend architecture, which allows us to reach up to 74.9% of the memory bandwidth achieved by memory copy. To showcase the effectiveness of matrix-multiplication-based scans as a fast primitive in AI workloads, we implemented several essential scan-based operators like radix sort and top-p, achieving respectively up to 3.3× and 2.3× speedup compared to the vector-only kernels. Finally, we show how these optimized kernels can impact real world models on the Ascend NPU obtaining 2.02× speed-up on state-space neural networks like Mamba, and up to 1.47× on LLMs with large vocabulary sizes thanks to more efficient top-p token sampling.
Primary Area: Deep Learning->Algorithms
Keywords: Parallel scan, AI accelerators, Tensor cores, Radix-sort, Mamba
Submission Number: 15576
Loading