KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

Published: 11 Jun 2025, Last Modified: 10 Jul 2025ES-FoMo IIIEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Model, KV Cache Compression, Efficient Inference
TL;DR: We propose a novel query-agnostic KV cache eviction method for multi-query scenario.
Abstract: Transformer-based large language models (LLMs) cache context as key-value (KV) pairs during inference. Longer context increases KV cache sizes, leading to significant memory overhead and higher attention computation latency. This paper introduces KVzip, a query-agnostic KV cache eviction method enabling effective reuse of compressed context caches across different queries. KVzip quantifies the importance of each KV pair using the underlying LLM to reconstruct original contexts from cached KV pairs, subsequently evicting pairs with lower importance. Extensive empirical evaluations demonstrate that KVzip reduces KV cache size by $3$-$4\times$ and decreases FlashAttention latency by approximately $2\times$, without performance degradation in question-answering, retrieval, mathematical reasoning, and code comprehension tasks. Evaluations include state-of-the-art models such as LLaMA3.1-8B, Qwen2.5-14B, and Gemma3-12B, with context lengths reaching up to 170K tokens. KVzip significantly outperforms existing KV eviction methods, which suffer performance losses even at a 90\% cache budget ratio under multi-query scenarios. Codes are available at https://github.com/snu-mllab/KVzip.
Submission Number: 23
Loading