Keywords: LLM Efficiency, Key-Value Cache Compression, Long-Context LLM, Inference Optimization
TL;DR: We propose a novel method that augments the LLM with parameter-efficient modules to perform fast and accurate KV cache eviction by predicting the attention pattern of the model's future response.
Abstract: Transformer-based large language models (LLMs) rely on key–value (KV) caching to avoid redundant computation during autoregressive inference. While this mechanism greatly improves efficiency, the cache size grows linearly with the input sequence length, quickly becoming a bottleneck for long‑context tasks. Existing solutions mitigate this problem by evicting prompt KV that are deemed unimportant, guided by estimated importance scores. Notably, a recent line of work improves eviction quality by “glimpsing into the future”, in which a low‑cost draft generator first produces a surrogate response that mimics the target model's true response, which is subsequently used to estimate the importance scores of cached KV. In this paper, we propose LookaheadKV, a lightweight eviction framework that leverages the strength of surrogate future response without the need for costly draft generation. LookaheadKV augments transformer layers with parameter‑efficient modules trained to predict true importance scores with high accuracy. Our design ensures negligible runtime overhead comparable to existing inexpensive heuristics, while achieving accuracy superior to more costly approximation methods. Extensive experiments on long-context understanding benchmarks, across a wide range of models, demonstrate that our method not only outperform recent competitive baseline by long-context understanding tasks by $25$\%, but also reduces the eviction cost by up to $14.5$×, leading to significantly faster time-to-first-token (TTFT).
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 17724
Loading