More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression

More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression

ACL ARR 2025 February Submission7133 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: As large language models (LLMs) process increasing context windows, the memory usage of KV cache has become a critical bottleneck during inference. The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately. However, these works leaving the trade-off between these two orthogonal dimensions largely under-explored. In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression. Experiments demonstrate that storing more tokens in the KV cache with lower precision, a strategy we term quantized pruning, can significantly enhance the long-context performance of LLMs. In-depth analysis of the token-precision trade-off across key aspects demonstrates that, quantized pruning achieves substantial improvements in retrieval-related tasks and consistently performs well across varying input lengths. Furthermore, quantized pruning demonstrates notable stability and effectiveness across different KV pruning methods, quantization strategies, and model scales. These findings offer valuable insights into optimizing KV cache compression through balanced token-precision trade-off strategies. We plan to release our code soon.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: pruning; quantization

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Data analysis

Languages Studied: English

Submission Number: 7133

Loading