Keywords: LLM, KV cache, compression, long-context
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in processing extensive contexts, but this ability comes with significant GPU memory costs, particularly in the key-value (KV) cache. Although recent KV cache compression methods show strong performance, all use discrete tokens to maintain the KV cache, leading to a loss of chunk semantic information. We introduce ChunkKV, a novel KV cache compression method that retains the most informative semantic chunks while discarding the less important ones. ChunkKV preserves semantic information by grouping related tokens. Furthermore, ChunkKV exhibits a higher similarity in the indices of the retained KV cache across different layers, so we also propose a layer-wise index reuse technique to further reduce computational overhead. This technique not only improves compression efficiency, but also provides insight into the similarities between layers within LLMs. We evaluated ChunkKV on long-context benchmarks including LongBench and Needle-In-A-HayStack, as well as the GSM8K in-context learning benchmark. Our experiments, conducted with models LLaMA-3-8B-Instruct, Mistral-7B-Instruct, and Qwen2-7B-Instruct, demonstrate that ChunkKV outperforms other KV cache compression methods in performance, even surpassing the full KV cache under the same conditions. With a compression ratio of 10\%, ChunkKV achieves state-of-the-art performance on various tasks, indicating its effectiveness in semantic preservation and model performance for long-context and in-context LLM inference.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 14025
Loading