Post-Training Sparse Attention with Double Sparsity

Shuo Yang; Ying Sheng; Joseph E. Gonzalez; Ion Stoica; Lianmin Zheng

Post-Training Sparse Attention with Double Sparsity

Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models, sparse attention, decoding

Abstract: Long-context inference of Large Language Models (LLMs) is known to be challenging due to the excessive Key-Value(KV) cache accesses. This paper introduces ``Double Sparsity,'' a novel post-training sparse attention technique designed to alleviate this bottleneck by reducing KV cache access. Double Sparsity combines token sparsity, which focuses on using only the important tokens for computing self-attention, with channel sparsity, an approach that uses important feature channels for identifying important tokens. Our key insight is that the pattern of channel sparsity is highly static, allowing us to use offline calibration to make it efficient at runtime, thereby enabling accurate and efficient identification of important tokens. Moreover, this method can be combined with offloading to achieve significant memory usage reduction. Experimental results demonstrate that Double Sparsity can achieve $\frac{1}{16}$ sparsity with minimal impact on accuracy across various tasks with different architectures including MHA, GQA, MoE and vision language model. It brings up to a 14.1$\times$ acceleration in attention operations and a 1.9$\times$ improvement in end-to-end inference on GPUs with various batch sizes. With CPU offloading under extremely long-context settings (e.g., 256K), it achieves a decoding speed acceleration of 16.3$\times$ compared to state-of-the-art solutions. Our code is integrated into a widely-used framework SGLang and deployed in real-world workloads.

Supplementary Material: zip

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4921

Loading