Online Pseudo-average Shifting Attention(PASA) for Robust Low-precision LLM Inference: Algorithms, Numerical Analysis and Performance
Keywords: Low-Precision Inference, Attention, Long Sequence, Multi-modal, Overflow, NPU
TL;DR: PASA: Accelerating attention with low precision computing for large models
Abstract: Attention computation remains a critical bottleneck for long-sequence inference in large models (e.g., long-text/video generation). To address this, we propose PASA(Pseudo-average Shifting Attention), a fully low-precision algorithm that maintains mathematical equivalence to Flash Attention while enabling stable half-precision computation. PASA introduces two key innovations: (1) online pseudo-average shifting and (2) global error recovery, which jointly prevent overflow and preserve numerical accuracy by dynamically adjusting attention score statistics. This approach significantly releases bandwidth pressure and leverages low-precision compute units on AI accelerators (e.g. NPUs). We identify that numerical instability in attention stems from a \textit{resonance} mechanism-phase alignment (or anti-alignment) between query and key matrices in the head dimension-which triggers overflow of attention score matrix and incurs large numerical errors. Experiments on Qwen2-7B and Stability AI/Stable-Video-Diffusion demonstrate that PASA eliminates these issues while achieving $1.2 - 1.65 \times$ speedup with fully half precision reaching up to $170$ TFLOPs/s on Ascend NPU compared to a highly optimized vendor Flash Attention from Ascend/CANN. To our knowledge, this is the first work to enable full half-precision acceleration for long-sequence attention (MHA/MQA) with guaranteed numerical stability.
Supplementary Material: zip
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 23965
Loading