Abstract: Large language models (LLMs) are powerful but require massive memory to cache the key/value vectors (KV cache) for efficient inference.
To reduce the memory burden, we propose MAT, a novel KV cache eviction strategy tailored to heterogeneous attention patterns observed in shallow and deep layers of LLMs. Through a detailed analysis of attention patterns in LLMs, we observe that,
for deeper layers, the anchor tokens, which consistently receive high attention logits from subsequent tokens, exhibit notably low attention logits between one another.
These observations motivate us to prioritize retaining anchor tokens based on their attention logits to the first token for deep layers.
For shallow layers, we retain the first few tokens of inputs as well as a sliding window to preserve local context.
Extensive experiments conducted on end-to-end, language modeling, and open-ended generation tasks demonstrate that MAT achieves superior performance compared with existing methods when using the same memory budgets.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: LLM Efficiency; NLP in resource-constrained settings
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Keywords: Efficient Inference; KV Cache Compression
Submission Number: 5721
Loading