Don’t Discard, but Keep It Small: Context-Preserving KV Cache Compression with Importance-Aware Adaptive Precision

June Yong Yang; Byeongwook Kim; Jeongin Bae; Gunho Park; Beomseok Kwon; Eunho Yang; Se Jung Kwon; Dongsoo Lee

Don’t Discard, but Keep It Small: Context-Preserving KV Cache Compression with Importance-Aware Adaptive Precision

June Yong Yang, Byeongwook Kim, Jeongin Bae, Gunho Park, Beomseok Kwon, Eunho Yang, Se Jung Kwon, Dongsoo Lee

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models, safety, hallucination, key-value cache compression, long context

Abstract: As the length of input sequences in Large Language Models (LLMs) continues to grow, efficient key-value (KV) cache management has become essential for improving inference speed and throughput of autoregressive decoding. Although several approaches have been proposed to reduce memory usage by selectively retaining only the important KV pairs and discarding the rest, these eviction-based methods can lead to unintended consequences during the generation process. In this paper, we investigate the adverse effects of cache eviction methods and reveal that discarding KV pairs potentially introduces risks such as safety prompt breaches, hallucinations, and loss of critical contextual information. Interestingly, we find that preserving even a fraction of the information from evicted KV pairs through reduced precision quantization significantly mitigates these issues. On the other hand, we also observe that important KV pairs need to be maintained at higher precision to preserve generation quality. Based on these findings, we propose Mixed-precision KV cache (MiKV), a robust plug-and-play cache compression method that balances performance and memory efficiency. MiKV preserves lost contextual information by storing evicted KV pairs in low precision, while maintaining the essential KV pairs in higher precision to ensure generation quality. Experimental results across multiple benchmarks and LLM architectures demonstrate that our method achieves a state-of-the-art balance between compression ratio and model performance, outperforming existing baselines.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7184

Loading