Keywords: Large Language Models, KV Cache, Machine Learning, Efficient Machine Learning
TL;DR: This work propose a plug-and-play data-level KV cache eviction method to enhance system efficiency in question-independent KV cache compression scenarios.
Abstract: Key-Value (KV) caching is a widely adopted technique in large language models (LLMs) to accelerate long-context inference. While recent studies predominantly focus on question-dependent KV cache eviction where cache entries are evicted based on known queries. In this paper, however, we observe these approaches often fail in question-independent scenarios, such as multi-turn dialogues and chunk pre-caching in retrieval-augmented generation (RAG), where future queries remain unknown. Our empirical analysis reveals that most existing KV cache eviction methods underperform in this setting due to their heavy reliance on importance metrics derived from the attention score with question tokens. The core challenge here is to conduct well-founded estimation on token importance without access to future questions. To address this, we propose OracleKV for question-independent KV cache eviction. OracleKV operates by steering model's attention with an oracle guidance containing surface-level statistics of user preferences from large-scale real-world dialogues. Unlike existing methods, OracleKV operates at the data level, allowing seamless integration with other eviction algorithms in a plug-and-play manner. Experiments on several multi-turn and single-turn benchmarks demonstrate that OracleKV achieves higher accuracy-latency tradeoff than existing KV cache compression approaches. We hope our approach will expand the design space and serve as a solid baseline for future research in KV cache compression.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 15136
Loading