Consensus Sparse Attention: A Memory and Computation Efficient Mechanism Based on Inter-Head Consensus

Consensus Sparse Attention: A Memory and Computation Efficient Mechanism Based on Inter-Head Consensus

ACL ARR 2025 February Submission8032 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The inference efficiency of large language models (LLMs) is limited by the computational complexity and memory usage of attention layers. To address these challenges, we introduce Consensus Sparse Attention (CSA), a technique that leverages the consensus of a few representative attention heads to select the Key tokens for the remaining heads, thereby limiting the attention computation space from all tokens to a small number of potential candidate tokens, effectively reducing computational and peak memory consumption without additional training. Experiments conducted on diverse scale models and varied downstream tasks demonstrate that CSA can offer a significant improvement in computational efficiency with a negligible accuracy decrease. In particular, CSA was able to achieve a two-fold speed increase, along with a half reduction of peak memory usage in the attention layer computation during the prefilling stage on LLaMA-3.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: Consensus Sparse Attention

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 8032

Loading