Keywords: KV Cache Compression; Large Language Models;Efficient Inference
TL;DR: Compress the KV Cache of LLMs by estimating attention scores from future queries.
Abstract: Large language models encounter a significant memory bottleneck during inference due to the Key-Value (KV) cache, which stores past token representations and grows linearly with context length. Although using attention scores to evict KV pairs is promising, it is often impractical in real-world scenarios because the attention scores from future tokens have not yet been computed, and modern implementations like Flash Attention do not materialize the full attention matrix, making past scores inaccessible too. To address these limitations, we introduce $\textit{Expected Attention}$ , a training-free method that estimates a KV pair's importance by approximating how future queries will attend to it. By leveraging the distributional properties of activations in LLMs, we compute the expected attention score in closed form for each KV pair. This score is then used to rank and prune KV pairs with the smallest impact on the residual stream, achieving compression without performance loss. Crucially, our approach works in both prefilling and decoding tasks, consistently outperforming state-of-the-art baselines in both scenarios. We release all our code to enable researchers to implement and build upon our methods.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 16706
Loading