Post-Training Sparse Attention with Double Sparsity

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language models, sparse attention, decoding
Abstract: Long-context inference of Large Language Models (LLMs) is known to be challenging due to the excessive Key-Value(KV) cache accesses. This paper introduces ``Double Sparsity,'' a novel post-training sparse attention technique designed to alleviate this bottleneck by reducing KV cache access. Double Sparsity combines token sparsity, which focuses on using only the important tokens for computing self-attention, with channel sparsity, an approach that uses important feature channels for identifying important tokens. Our key insight is that the pattern of channel sparsity is highly static, allowing us to use offline calibration to make it efficient at runtime, thereby enabling accurate and efficient identification of important tokens. Moreover, this method can be combined with offloading to achieve significant memory usage reduction. Experimental results demonstrate that Double Sparsity can achieve \(\frac{1}{16}\) sparsity with minimal impact on accuracy across various tasks with different architectures including MHA, GQA, MoE and vision language model. It brings up to a 14.1$\times$ acceleration in attention operations and a 1.9$\times$ improvement in end-to-end inference on GPUs with various batch sizes. With CPU offloading under extremely long-context settings (e.g., 256K), it achieves a decoding speed acceleration of 16.3$\times$ compared to state-of-the-art solutions. Our code is integrated into a widely-used framework SGLang and deployed in real-world workloads.
Supplementary Material: zip
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4921
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview