ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

Xiang Liu; Zhenheng Tang; Peijie Dong; Zeyu Li; Bo Li; Xuming Hu; Xiaowen Chu

ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Bo Li, Xuming Hu, Xiaowen Chu

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, KV cache, compression, long-context

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in processing extensive contexts, but this ability comes with significant GPU memory costs, particularly in the key-value (KV) cache. Although recent KV cache compression methods show strong performance, all use discrete tokens to maintain the KV cache, leading to a loss of chunk semantic information. We introduce ChunkKV, a novel KV cache compression method that retains the most informative semantic chunks while discarding the less important ones. ChunkKV preserves semantic information by grouping related tokens. Furthermore, ChunkKV exhibits a higher similarity in the indices of the retained KV cache across different layers, so we also propose a layer-wise index reuse technique to further reduce computational overhead. This technique not only improves compression efficiency, but also provides insight into the similarities between layers within LLMs. We evaluated ChunkKV on long-context benchmarks including LongBench and Needle-In-A-HayStack, as well as the GSM8K in-context learning benchmark. Our experiments, conducted with models LLaMA-3-8B-Instruct, Mistral-7B-Instruct, and Qwen2-7B-Instruct, demonstrate that ChunkKV outperforms other KV cache compression methods in performance, even surpassing the full KV cache under the same conditions. With a compression ratio of 10\%, ChunkKV achieves state-of-the-art performance on various tasks, indicating its effectiveness in semantic preservation and model performance for long-context and in-context LLM inference.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 14025

Loading