Consensus Sparse Attention: A Memory and Computation Efficient Mechanism Based on Inter-Head Consensus
Abstract: Large language models have achieved state-of-the-art performance across a wide range of NLP tasks, but their deployment is constrained by the attention mechanism’s quadratic scaling with sequence length, leading to extensive memory requirements.
We propose Consensus Sparse Attention (CSA), an efficient and lightweight optimization method that enhances attention calculation performance while maintaining model accuracy, without requiring additional post-training.
CSA determines a few representative attention heads to identify a consensus set of salient tokens. These selected tokens are then shared across all remaining heads. This mechanism significantly reduces both computational cost and memory consumption while preserving the model's contextual understanding and information.
CSA integrates seamlessly into existing attention architectures and requires no further adaptation. Experimentally, CSA delivers 2× inference speedup and 50\% lower peak memory usage while maintaining 99.7\% accuracy on LLaMA-3, 99.8\% on Qwen2, and 99\% on the Needle In A Haystack benchmark for long-context understanding.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: Consensus Sparse Attention
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 5860
Loading