SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling

ICLR 2026 Conference Submission12505 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM inference, Quantization, Sparse attention, Prefilling
Abstract: Many advanced Large Language Model (LLM) applications require long-context processing, but the self-attention module becomes a bottleneck during the prefilling stage of inference due to its quadratic time complexity with respect to sequence length. Existing sparse attention methods accelerate attention computation by skipping less significant regions of the attention map. However, these approaches typically perform coarse-grained inspection of the attention map, resulting in their suboptimal performance. In this paper, we propose SALE, a fine-grained sparse attention method that accelerates the long-context prefilling stage of LLM with negligible loss in model accuracy. SALE achieves fast and accurate fine-grained attention map estimation using low-bit quantized query-key products to approximate attention weights, followed by the application of a novel Relative Attention Score metric to assess the importance of query-key pairs. This design enables us to accurately identify important regions in the attention map, thereby constructing a highly sparse attention mask. We implement a custom CUDA kernel in SALE optimized for hardware efficiency, reducing overhead to approximately 11% of the full attention latency. Notably, SALE requires no parameter training and can be seamlessly integrated into existing systems with trivial code modifications. Experiments on long-context benchmarks demonstrate that our method outperforms existing approaches in accuracy-efficiency trade-offs, achieving at least 3.36× speedups on Llama-3.1-8B for sequences longer than 64K while maintaining model quality.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 12505
Loading