KVpop - Retrofitting LLMs with xLSTM-guided Token Eviction
Abstract: Key-value (KV) cache growth is a major bottleneck in autoregressive decoding, as memory and bandwidth scale linearly with the context length.
Existing KV-eviction methods often rely on static heuristics or early retention decisions, which miss downstream context and cause brittle eviction as token relevance shifts.
To address this, we introduce KVpop, which uses stateful delayed importance scoring at eviction time for context-aware cache management under a fixed per-head budget.
KVpop is trained with a novel future-attention importance target that estimates long-term token utility without materializing dense attention.
We show that KVpop preserves dense attention performance on challenging mathematical reasoning tasks, while reducing KV cache size by 75%. Even at ~94% KV compression, KVpop still retains ~80% of performance, almost doubling baseline recovery.
Our results indicate that timing-aware eviction cuts KV memory costs while maintaining quality.
Submission Number: 64
Loading