Online Pseudo-average Shifting Attention(PASA) for Robust Low-precision LLM Inference: Algorithms, Numerical Analysis and Performance

Long Cheng; Qichen Liao; Fan Wu; Junlin Mu; Tengfei Han; Zhe Qiu; Lianqiang Li; Zhen Zhang; Fangzheng Miao; Tianyi Liu; Keming Gao; LiangWangdym; Yinqiande

Online Pseudo-average Shifting Attention(PASA) for Robust Low-precision LLM Inference: Algorithms, Numerical Analysis and Performance

Long Cheng, Qichen Liao, Fan Wu, Junlin Mu, Tengfei Han, Zhe Qiu, Lianqiang Li, Zhen Zhang, Fangzheng Miao, Tianyi Liu, Keming Gao, LiangWangdym, Yinqiande

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Low-Precision Inference, Attention, Long Sequence, Multi-modal, Overflow, NPU

TL;DR: PASA: Accelerating attention with low precision computing for large models

Abstract: Attention computation remains a critical bottleneck for long-sequence inference in large models (e.g., long-text/video generation). To address this, we propose PASA(Pseudo-average Shifting Attention), a fully low-precision algorithm that maintains mathematical equivalence to Flash Attention while enabling stable half-precision computation. PASA introduces two key innovations: (1) online pseudo-average shifting and (2) global error recovery, which jointly prevent overflow and preserve numerical accuracy by dynamically adjusting attention score statistics. This approach significantly releases bandwidth pressure and leverages low-precision compute units on AI accelerators (e.g. NPUs). We identify that numerical instability in attention stems from a \textit{resonance} mechanism-phase alignment (or anti-alignment) between query and key matrices in the head dimension-which triggers overflow of attention score matrix and incurs large numerical errors. Experiments on Qwen2-7B and Stability AI/Stable-Video-Diffusion demonstrate that PASA eliminates these issues while achieving $1.2 - 1.65 \times$ speedup with fully half precision reaching up to $170$ TFLOPs/s on Ascend NPU compared to a highly optimized vendor Flash Attention from Ascend/CANN. To our knowledge, this is the first work to enable full half-precision acceleration for long-sequence attention (MHA/MQA) with guaranteed numerical stability.

Supplementary Material: zip

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Submission Number: 23965

Loading