Consensus Sparse Attention: A Memory and Computation Efficient Mechanism Based on Inter-Head Consensus

Consensus Sparse Attention: A Memory and Computation Efficient Mechanism Based on Inter-Head Consensus

ACL ARR 2025 May Submission5860 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models have achieved state-of-the-art performance across a wide range of NLP tasks, but their deployment is constrained by the attention mechanism’s quadratic scaling with sequence length, leading to extensive memory requirements. We propose Consensus Sparse Attention (CSA), an efficient and lightweight optimization method that enhances attention calculation performance while maintaining model accuracy, without requiring additional post-training. CSA determines a few representative attention heads to identify a consensus set of salient tokens. These selected tokens are then shared across all remaining heads. This mechanism significantly reduces both computational cost and memory consumption while preserving the model's contextual understanding and information. CSA integrates seamlessly into existing attention architectures and requires no further adaptation. Experimentally, CSA delivers 2× inference speedup and 50\% lower peak memory usage while maintaining 99.7\% accuracy on LLaMA-3, 99.8\% on Qwen2, and 99\% on the Needle In A Haystack benchmark for long-context understanding.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: Consensus Sparse Attention

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 5860

Loading