Don’t Discard, but Keep It Small: Context-Preserving KV Cache Compression with Importance-Aware Adaptive Precision
Keywords: large language models, safety, hallucination, key-value cache compression, long context
Abstract: As the length of input sequences in Large Language Models (LLMs) continues to grow, efficient key-value (KV) cache management has become essential for improving inference speed and throughput of autoregressive decoding.
Although several approaches have been proposed to reduce memory usage by selectively retaining only the important KV pairs and discarding the rest, these eviction-based methods can lead to unintended consequences during the generation process.
In this paper, we investigate the adverse effects of cache eviction methods and reveal that discarding KV pairs potentially introduces risks such as safety prompt breaches, hallucinations, and loss of critical contextual information.
Interestingly, we find that preserving even a fraction of the information from evicted KV pairs through reduced precision quantization significantly mitigates these issues.
On the other hand, we also observe that important KV pairs need to be maintained at higher precision to preserve generation quality.
Based on these findings, we propose Mixed-precision KV cache (MiKV), a robust plug-and-play cache compression method that balances performance and memory efficiency.
MiKV preserves lost contextual information by storing evicted KV pairs in low precision, while maintaining the essential KV pairs in higher precision to ensure generation quality.
Experimental results across multiple benchmarks and LLM architectures demonstrate that our method achieves a state-of-the-art balance between compression ratio and model performance, outperforming existing baselines.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7184
Loading