IntelLLM: Little Hints Make a Big Difference for LLM KV Cache Compression

27 Sept 2024 (modified: 26 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, KV cache compression, CGE, RGL
Abstract: Large Language Models (LLMs) have demonstrated exceptional capabilities in integrating contextual knowledge, but their deployment is often constrained by the substantial computational resources required for long text sequences. To mitigate the inference time cost associated with attention mechanisms, LLMs utilize key-value embedding caching techniques (KV cache), which introduce significant storage pressure. In this paper, we propose IntelLLM, a novel and efficient approach to KV cache compression that strikes a balance between compression rate and performance. Drawing inspiration from sparse attention mechanism, we observe that only a small subset of tokens in lengthy texts capture the majority of attention weights. This sparsity, intrinsic to the attention mechanism, serves as the foundation for improving the KV compression ratio through a strategic eviction method. IntelLLM is composed of center of gravity eviction (CGE) strategy and remote gap localization (RGL) strategy. CGE is designed to address the potential loss of important semantic dependencies when evicting high-sparsity tokens, which prioritizes the retention of key tokens by shielding the center of gravity of attention during inference, thereby preserving critical information and optimizing the efficiency of attention computation. Additionally, RGL is proposed to leverage implicit positional features to maintain long-range dependencies, inspired by advancements in location encoding research. Our KV compression approach integrates seamlessly with existing LLMs, requiring minimal code modifications without the need for fine-tuning or model parameter changes. IntelLLM not only significantly reduces the storage requirements for KV cache but also consistently outperforms full KV models in long text processing tasks, while utilizing only 50% of the typical KV cache expenses.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9602
Loading