SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models

Haojie Duanmu; Zhihang Yuan; Xiuhong Li; Jiangfei Duan; Xingcheng ZHANG; Dahua Lin

SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models

Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng ZHANG, Dahua Lin

Published: 10 Jul 2024, Last Modified: 29 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0

Research Area: Compute efficient LMs

Keywords: quantization, KVCache compression, model compression

TL;DR: This paper proposes SKVQ, a sliding-window KV cache quantization strategy that achieves high compression ratios while maintaining accuracy, enabling efficient handling of long context lengths in large language models.

Abstract: Large language models (LLMs) have demonstrated the capability to process extended token sequences, enabling complex tasks such as book comprehension and long-form text generation. However, as context length increases, the key-value (KV) cache required for LLMs consumes substantial memory, becoming a bottleneck for deployment. This paper introduces SKVQ (Sliding-window KV cache Quantization), a strategy designed to address the challenge of extremely low bitwidth KV cache quantization. SKVQ rearranges the channels of the KV cache to enhance channel similarity within quantization groups and applies clipped dynamic quantization at the group level. Furthermore, SKVQ maintains high precision for the most recent window tokens in the KV cache, preserving accuracy for a small yet critical portion of the cache. Our evaluation of LLMs demonstrates that SKVQ achieves high compression ratios while maintaining accuracy, outperforming previous quantization methods. SKVQ enables the quantization of the KV cache to 2-bit keys and 1.5-bit values with minimal accuracy loss. This advancement allows processing context lengths of up to 1M tokens on an 80GB GPU for a 7B parameter model, resulting in up to 7 times faster decoding.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 860

Loading