Keywords: LLM Inference Acceleration; Long Context; KV Cache Compression
TL;DR: A head-specific KV cache retention method that preserves context for the heads that need it, maintaining multi-turn fidelity while reducing memory.
Abstract: Large Language Model (LLM) inference commonly requires caching all Key-Value (KV) states. This KV cache leads to substantial memory usage and increasing latency in long-context settings. Existing KV cache compression methods reduce cache size by keeping only tokens relevant to the current query, but discarding middle context tokens needed by queries in later turns - harming multi-turn fidelity. We observe head specialization: a minority of attention heads are Context-Anchored (CA), preferring middle context tokens, while most are locality heads, focusing on sink tokens and recent tokens. This motivates ContextKeeper, a training-free, head-specific KV retention policy that preserves all middle context tokens for CA heads and drops them for locality heads. Unlike prior head-splitting methods that require complex training procedures or deliver limited gains, our policy is derived by running inference on a small set of task samples and integrates as a plug-and-play inference strategy. ContextKeeper reduces KV cache size by up to 3.86× and lowers decoding latency by up to 1.25×, while introducing negligible accuracy loss compared to full attention across different models and 5-turn queries with up to 128K tokens. These results demonstrate a practical and scalable query-agnostic KV compression method that preserves multi-turn fidelity under tight memory budgets for long-context deployment.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 18544
Loading