Abstract: Large Language Models (LLMs) depend on model- and architecture-specific optimizations to be efficiently
executed at a large scale. The number of new LLM variants grows rapidly, making it necessary to distill and
only optimize a few common parallel computational primitives, including prefix sum (scan). In this work, we
design high-performance prefix-sum algorithms for the Ascend NPU architecture and explore their applicability
in AI workloads and LLMs. The key feature of our algorithms is the efficient use of vector and matrix units in
the Ascend architecture, which allows us to reach up to 74.9% of the memory bandwidth achieved by memory
copy. To showcase the effectiveness of matrix-multiplication-based scans as a fast primitive in AI workloads, we
implemented several essential scan-based operators like radix sort and top-p, achieving respectively up to 3.3×
and 2.3× speedup compared to the vector-only kernels. Finally, we show how these optimized kernels can impact
real world models on the Ascend NPU obtaining 2.02× speed-up on state-space neural networks like Mamba, and
up to 1.47× on LLMs with large vocabulary sizes thanks to more efficient top-p token sampling.
Primary Area: Deep Learning->Algorithms
Keywords: Parallel scan, AI accelerators, Tensor cores, Radix-sort, Mamba
Submission Number: 15576
Loading