Keywords: LLM Inference, Sparse Attention
Abstract: Long-context LLMs increasingly rely on long prefill prompts for agents and domain Q&A, pushing attention and KV-cache to become the dominant decode-time bottlenecks. While sparse attention methods reduce computation and transfer costs, they struggle to simultaneously maintain model accuracy and achieve high inference speed under high sparsity. To address this challenge, we propose Centroid-Scoring Attention (CSAttention), a training-free sparse attention method for efficient LLM inference. CSAttention adopts a storage-for-computation strategy: it leverages query distributions to construct a fixed-size, query-centric lookup table in each subspace during the offline prefill stage, enabling online decoding to perform efficient searches and centroid-score accumulation over regular, GPU-friendly data structures. By combining subspace partitioning with query-centric table construction, CSAttention mitigates distribution shift between queries and keys, and reliably recovers high-scoring keys even under very high sparsity, enabling significant computational savings while maintaining competitive model performance. Extensive experiments demonstrate that CSAttention maintains near-lossless model accuracy while delivering substantial improvements in inference efficiency. Compared to state-of-the-art sparse attention methods, CSAttention achieves superior model accuracy and higher inference speed in high-sparsity (95%) and long-context (32K-128K) scenarios. Notably, CSAttention achieves up to 4.24× speedup over full attention when decoding 128K context length, demonstrating its practical value for scalable long-context inference.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 15846
Loading