RACC: Retrieval-Augmented KV Cache Compression in Long-Context Generation

ICLR 2026 Conference Submission15530 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Efficient Inference, KV Cache Compression, Vector Retrieval
Abstract: Large Language Models (LLMs) have achieved remarkable progress in long-context generation. As the context length increases, the Key--Value (KV) cache requires the GPU memory with a linear growth rate. KV cache compression is treated as a promising method to reduce the memory usage by permanently discarding a large portion of unimportant KV pairs, but at the expense of inference accuracy. On the other hand, retrieval-based methods employ the CPU memory to store the full KV cache and compute the attention via expensive CPU-GPU I/O, which keeps the accuracy but suffers from huge inference latency. To address these issues, we propose a new inference framework called RACC, which combines both compression based methods and retrieval based methods. To be specific, we employ the KV cache compression method to maintain a high-quality KV cache in the GPU memory, while sotring all the KV pairs evicted by the compression method. In addition, efficient and accurate retrieval conducted on the CPU side finds out important tokens for the one being generated, which is then concatenated with those KV cached in the GPU memory for accurate generation. Extensive experiments demonstrate that RACC achieves near-lossless inference while using only 15\% of the original KV cache. Moreover, its combination with prefill-only compression methods improves generation accuracy by 3--10\%. Our code is publicly available at \url{https://anonymous.4open.science/r/CDKEY/}.
Primary Area: generative models
Submission Number: 15530
Loading