VQKV: High-Fidelity and High-Ratio Cache Compression via Vector-Quantization

Yixuan Wang; Qingyu Shi; Jiayu Zhou; Ziwei He; Zhouhan Lin

VQKV: High-Fidelity and High-Ratio Cache Compression via Vector-Quantization

Yixuan Wang, Qingyu Shi, Jiayu Zhou, Ziwei He, Zhouhan Lin

20 Sept 2025 (modified: 04 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, KV cache optimization, Vector Quantization, Efficient LLM

TL;DR: We propose VQKV, a training-free method for KV cache compression using RSimVQ. It achieves over 80% compression and minimal performance loss, surpasses existing approaches under the same ratio.

Abstract: The increasing context length in Large Language Models (LLMs) leads to a proportional growth of the Key-Value (KV) cache, posing a significant challenge for their deployment in resource-limited settings. While existing training-free methods for KV cache compression, such as token eviction, feature dimension reduction, and scalar quantization, can reduce memory usage, they often do so at the cost of diminished model performance, especially at high compression ratios. To resolve the trade-off between memory efficiency and model fidelity, we introduce VQKV, a novel, training-free KV cache compression method based on vector quantization. Instead of discarding tokens or compressing individual dimensions, VQKV maps entire high-dimensional cache vectors to a compact, learned codebook. This approach allows for the representation of thousands of floating-point values with just a few integer indices corresponding to the codebook. As a result, VQKV achieves a significant compression ratio while enabling high-fidelity reconstruction of the original cache vectors through a simple codebook lookup. VQKV achieves a high compression ratio with minimal performance degradation. Extensive evaluations on LLaMA3.1-8B and LLaMA3.2-3B models across long-context benchmarks demonstrate that VQKV significantly outperforms existing state-of-the-art compression methods at similar compression ratios, highlighting its effectiveness in preserving information while substantially reducing the memory footprint of the KV cache.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 24276

Loading