Keywords: Large Language Models, Efficiency, Compression, Long Context
Abstract: As context lenghts grows, the increasing size of Key and Value (KV) cache poses a significant challenge to efficiently serving Large Language Models (LLMs). KV cache pruning, by preserving only a small subset of important KV cache for sparse inference, is a recognized effective solution. Our research revealed that large activations are the key to identifying these important KV cache. However, existing methods have not been successful in effectively identifying these important KV cache due to neglecting the impact of Value cache, and are also incompatible with Grouped-Query Attention (GQA) architectures. To address these issues, we introduce an innovative KV cache pruning method that preserves these large activations and is compatible with Grouped-Query Attention. Featuring a novel pruning metric, this method operates within each attention group to enhance efficiency and minimize performance degradation. Experimental results demonstrate that our approach not only maintains comparable accuracy with existing methods but also significantly reduces KV cache requirements. Specifically, It demonstrates similar accuracy while utilizing only 1/10 of the KV cache compared to existing SOTA methods.
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 757
Loading