Keywords: KV-cache compression, Heavy-Hitter Oracle, autonomous agents, memory efficiency, transformer optimization, VRAM management, sparse attention, foundation models
TL;DR: Proposes $\ell_2$-norm KV-cache eviction (no attention tracking). Matches full cache at $\ge$ 512 tokens, but at 256 tokens underperforms sliding window—recency dominates at tight budgets.
Abstract: Large language models deployed as autonomous agents face a fundamental memory
constraint: the KV-cache required for autoregressive generation scales
quadratically with context length. We propose \textbf{$\ell_2$-Norm Eviction},
a novel gradient-free KV-cache compression method that scores tokens by the
mean $\ell_2$-norm of their key vectors across attention heads, retaining a
hybrid of high-norm heavy hitters and recent tokens. Unlike H2O~\cite{h2o},
which requires accumulating explicit attention scores across all decoding steps,
our method operates with a single pass over key tensors and imposes no
attention-tracking overhead. We evaluate $\ell_2$-Norm Eviction against a
full-cache baseline and a StreamingLLM-style sliding window on the GSM8K
mathematical reasoning benchmark and curated logic prompts, using automated
Exact Match scoring across four cache budgets (256--2048 tokens) on
Mistral-7B-Instruct-v0.3. At budgets 512--2048, the eviction condition ($T > B$) is never satisfied
because total sequence lengths remain below 512 tokens in our evaluation set;
no tokens are dropped and all methods match the full-cache baseline exactly.
At the extreme budget of 256 (87.5\% reduction), where eviction does fire,
the sliding window (EM=0.25) outperforms $\ell_2$-Norm Eviction
(EM=0.05) on GSM8K, indicating that recency dominates global token
importance at very tight budgets. We characterise this as a minimum viable
budget effect and identify adaptive pool sizing as the key direction for
closing this gap.
Submission Number: 101
Loading