ThinK: Thinner Key Cache by Query-Driven Pruning

Yuhui Xu; Zhanming Jie; Hanze Dong; Lei Wang; Xudong Lu; Aojun Zhou; Amrita Saha; Caiming Xiong; Doyen Sahoo

ThinK: Thinner Key Cache by Query-Driven Pruning

Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, Doyen Sahoo

Published: 22 Jan 2025, Last Modified: 25 Feb 2025ICLR 2025 SpotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models; KV Cache Compression; KV Cache Pruning

TL;DR: We propose a novel query-dependent KV cache channel pruning method to reduce the memory usage of LLMs during inference.

Abstract: Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. However, their increased computational and memory demands present significant challenges, especially when handling long sequences. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. Unlike existing approaches that optimize the memory based on the sequence length, we identify substantial redundancy in the channel dimension of the KV cache, as indicated by an uneven magnitude distribution and a low-rank structure in the attention weights. In response, we propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels. Our approach not only maintains or enhances model accuracy but also achieves a reduction in KV cache memory costs by over 20\% compared with vanilla KV cache eviction and quantization methods. For instance, ThinK integrated with KIVI can achieve $2.8\times$ peak memory reduction while maintaining nearly the same quality, enabling a batch size increase from 4$\times$ (with KIVI alone) to 5$\times$ when using a single GPU. Extensive evaluations on the LLaMA and Mistral models across various long-sequence datasets verified the efficiency of ThinK. Our code has been made available at https://github.com/SalesforceAIResearch/ThinK.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1419

Loading