MAT: Unlocking KV Cache Efficiency via Managing Anchor Tokens

MAT: Unlocking KV Cache Efficiency via Managing Anchor Tokens

ACL ARR 2025 May Submission5721 Authors

20 May 2025 (modified: 04 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) are powerful but require massive memory to cache the key/value vectors (KV cache) for efficient inference. To reduce the memory burden, we propose MAT, a novel KV cache eviction strategy tailored to heterogeneous attention patterns observed in shallow and deep layers of LLMs. Through a detailed analysis of attention patterns in LLMs, we observe that, for deeper layers, the anchor tokens, which consistently receive high attention logits from subsequent tokens, exhibit notably low attention logits between one another. These observations motivate us to prioritize retaining anchor tokens based on their attention logits to the first token for deep layers. For shallow layers, we retain the first few tokens of inputs as well as a sliding window to preserve local context. Extensive experiments conducted on end-to-end, language modeling, and open-ended generation tasks demonstrate that MAT achieves superior performance compared with existing methods when using the same memory budgets.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: LLM Efficiency; NLP in resource-constrained settings

Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Keywords: Efficient Inference; KV Cache Compression

Submission Number: 5721

Loading