Norm-Guided KV-Cache Eviction for Memory-Efficient Reasoning

Published: 03 Mar 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop MemAgentsEveryoneRevisionsBibTeXCC BY 4.0
Keywords: KV-cache compression, Heavy-Hitter Oracle, autonomous agents, memory efficiency, transformer optimization, VRAM management, sparse attention, foundation models
TL;DR: Proposes $\ell_2$-norm KV-cache eviction (no attention tracking). Matches full cache at $\ge$ 512 tokens, but at 256 tokens underperforms sliding window—recency dominates at tight budgets.
Abstract: Large language models deployed as autonomous agents face a fundamental memory constraint: the KV-cache required for autoregressive generation scales quadratically with context length. We propose \textbf{$\ell_2$-Norm Eviction}, a novel gradient-free KV-cache compression method that scores tokens by the mean $\ell_2$-norm of their key vectors across attention heads, retaining a hybrid of high-norm heavy hitters and recent tokens. Unlike H2O~\cite{h2o}, which requires accumulating explicit attention scores across all decoding steps, our method operates with a single pass over key tensors and imposes no attention-tracking overhead. We evaluate $\ell_2$-Norm Eviction against a full-cache baseline and a StreamingLLM-style sliding window on the GSM8K mathematical reasoning benchmark and curated logic prompts, using automated Exact Match scoring across four cache budgets (256--2048 tokens) on Mistral-7B-Instruct-v0.3. At budgets 512--2048, the eviction condition ($T > B$) is never satisfied because total sequence lengths remain below 512 tokens in our evaluation set; no tokens are dropped and all methods match the full-cache baseline exactly. At the extreme budget of 256 (87.5\% reduction), where eviction does fire, the sliding window (EM=0.25) outperforms $\ell_2$-Norm Eviction (EM=0.05) on GSM8K, indicating that recency dominates global token importance at very tight budgets. We characterise this as a minimum viable budget effect and identify adaptive pool sizing as the key direction for closing this gap.
Submission Number: 101
Loading