ClusterGen: Token Generation in Sublinear Time and Memory with Clustering KV Cache

Amir Zandieh; Insu Han; Vahab Mirrokni; Amin Karbasi

ClusterGen: Token Generation in Sublinear Time and Memory with Clustering KV Cache

Amir Zandieh, Insu Han, Vahab Mirrokni, Amin Karbasi

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: KV cache, large language models, clustering

Abstract: Despite the significant success of large language models (LLMs), their extensive memory requirements pose challenges for deploying them in long-context token generation. The substantial memory footprint of LLM decoders arises from the necessity to store all previous tokens in the attention module, a requirement imposed by key-value (KV) caching. In this work, our focus is on developing an efficient compression technique for the KV cache. Empirical evidence indicates a significant clustering tendency within key embeddings in the attention module. Building on this key insight, we have devised a novel caching method with sublinear complexity, employing online clustering on key tokens and online sampling on values. The result is a provably accurate and efficient attention decoding algorithm, termed ClusterGen. Not only does this algorithm ensure a sublinear memory footprint and sublinear time complexity, but we also establish a tight error bound for our approach. Empirical evaluations on long-context question-answering tasks demonstrate that ClusterGen significantly outperforms existing and state-of-the-art KV cache compression methods in terms of performance and efficiency.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 13479

Loading