Consensus Sparse Attention: A Memory and Computation Efficient Mechanism Based on Inter-Head Consensus
Abstract: The inference efficiency of large language models (LLMs) is limited by the computational complexity and memory usage of attention layers. To address these challenges, we introduce Consensus Sparse Attention (CSA), a technique that leverages the consensus of a few representative attention heads to select the Key tokens for the remaining heads, thereby limiting the attention computation space from all tokens to a small number of potential candidate tokens, effectively reducing computational and peak memory consumption without additional training.
Experiments conducted on diverse scale models and varied downstream tasks demonstrate that CSA can offer a significant improvement in computational efficiency with a negligible accuracy decrease. In particular, CSA was able to achieve a two-fold speed increase, along with a half reduction of peak memory usage in the attention layer computation during the prefilling stage on LLaMA-3.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: Consensus Sparse Attention
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 8032
Loading