CAOTE: Optimizing KV Cache Memory Through Attention Output Error-based Token Eviction

Published: 03 Mar 2026, Last Modified: 06 Mar 2026ICLR 2026 Workshop MemAgentsEveryoneRevisionsBibTeXCC BY 4.0
Keywords: KV Cache Memory Optimization, Training-free memory controller, Attention output error
TL;DR: CAOTE is a closed‑form, near‑zero‑overhead token‑eviction criterion that preserves long‑context LLM performance by minimizing attention‑output error, delivering large accuracy and recall gains while keeping KV memory tightly bounded.
Abstract: Long‑context support in large language models (LLMs) amplifies memory and compute bottlenecks during inference, especially in resource‑constrained environments. A major contributor is the key–value (KV) cache, which grows linearly with sequence length and can exceed model size. Token eviction—removing less important tokens from the cache—is a widely adopted post‑training strategy, but existing methods rely solely on attention scores, ignoring the contribution of value vectors to the attention output. We introduce CAOTE, a closed‑form criterion that minimizes the change in attention output caused by eviction. CAOTE integrates attention weights and value vectors to compute an exact eviction error per token, and can act as a meta‑policy atop existing heuristics such as H2O, TOVA, and SNAPKV. Across LLaMA‑3 and Qwen‑2.5 models, CAOTE consistently improves accuracy on LongBench, reduces perplexity gaps, and boosts Needle‑in‑Haystack recall by up to 60% at tight budgets (2k–4k tokens). Theoretical analysis shows that CAOTE adds negligible overhead \(<0.1\%\) of prefill FLOPs for 4k–32k contexts), and an efficient variant (FastCAOTE) achieves similar gains with further compute savings. By bounding KV memory while preserving task quality, CAOTE offers a practical, drop‑in solution for long‑context LLM serving.
Submission Number: 25
Loading