Preserving Large Activations: The Key to KV Cache Pruning

Rui Kong; Yuanchun Li; qingtian feng; Linghe Kong; Yunxin Liu

Preserving Large Activations: The Key to KV Cache Pruning

Rui Kong, Yuanchun Li, qingtian feng, Linghe Kong, Yunxin Liu

14 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Efficiency, Compression, Long Context

Abstract: As context lenghts grows, the increasing size of Key and Value (KV) cache poses a significant challenge to efficiently serving Large Language Models (LLMs). KV cache pruning, by preserving only a small subset of important KV cache for sparse inference, is a recognized effective solution. Our research revealed that large activations are the key to identifying these important KV cache. However, existing methods have not been successful in effectively identifying these important KV cache due to neglecting the impact of Value cache, and are also incompatible with Grouped-Query Attention (GQA) architectures. To address these issues, we introduce an innovative KV cache pruning method that preserves these large activations and is compatible with Grouped-Query Attention. Featuring a novel pruning metric, this method operates within each attention group to enhance efficiency and minimize performance degradation. Experimental results demonstrate that our approach not only maintains comparable accuracy with existing methods but also significantly reduces KV cache requirements. Specifically, It demonstrates similar accuracy while utilizing only 1/10 of the KV cache compared to existing SOTA methods.

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 757

Loading