OracleKV: Oracle Guidance for Question-Independent KV Cache Eviction

Yuanbing Zhu; Zhenheng Tang; Xiang Liu; Ang Li; Bo Li; Xiaowen Chu; Bo Han

OracleKV: Oracle Guidance for Question-Independent KV Cache Eviction

Yuanbing Zhu, Zhenheng Tang, Xiang Liu, Ang Li, Bo Li, Xiaowen Chu, Bo Han

Published: 10 Jun 2025, Last Modified: 25 Jun 2025LCFM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, KV Cache, Machine Learning System, Efficiency

TL;DR: This work propose a plug-and-play data-level KV cache eviction method to enhance system efficiency in question-independent KV cache compression scenarios.

Abstract: Key-Value (KV) caching is a widely adopted technique in large language models (LLMs) to accelerate long-context inference. While recent studies predominantly focus on question-dependent KV cache eviction where cache entries are evicted based on known queries. In this paper, however, we observe these approaches often fail in question-independent scenarios, such as multi-turn dialogues and chunk pre-caching in retrieval-augmented generation (RAG), where future queries remain unknown. Our empirical analysis reveals that most existing KV cache eviction methods underperform in this setting due to their heavy reliance on importance metrics derived from question tokens. The core challenge here is to conduct well-founded estimation on token importance without access to future questions. To address this, we propose OracleKV for question-independent KV cache eviction. OracleKV operates by steering model's attention with an oracle guidance containing surface-level statistics of user preferences from large-scale real-world dialogues. Unlike existing methods, OracleKV operates at the data level, allowing seamless integration with other eviction algorithms in a plug-and-play manner. We evaluate OracleKV on both multi-turn and single-turn benchmarks, demonstrating its efficiency and effectiveness. Furthermore, we reveal the significant potential of data-level intervention in KV cache compression, expanding the design space of future research.

Submission Number: 39

Loading