Keywords: Large Language Model, KV cache optimization, Vector Quantization, Efficient LLM
TL;DR: We propose VQKV, a training-free method for KV cache compression using RSimVQ. It achieves over 80% compression and minimal performance loss, surpasses existing approaches under the same ratio.
Abstract: The increasing context length in Large Language Models (LLMs) leads to a proportional growth of the Key-Value (KV) cache, posing a significant challenge for their deployment in resource-limited settings. While existing training-free methods for KV cache compression, such as token eviction, feature dimension reduction, and scalar quantization, can reduce memory usage, they often do so at the cost of diminished model performance, especially at high compression ratios. To resolve the trade-off between memory efficiency and model fidelity, we introduce VQKV, a novel, training-free KV cache compression method based on vector quantization. Instead of discarding tokens or compressing individual dimensions, VQKV maps entire high-dimensional cache vectors to a compact, learned codebook. This approach allows for the representation of thousands of floating-point values with just a few integer indices corresponding to the codebook. As a result, VQKV achieves a significant compression ratio while enabling high-fidelity reconstruction of the original cache vectors through a simple codebook lookup. VQKV achieves a high compression ratio with minimal performance degradation. Extensive evaluations on LLaMA3.1-8B and LLaMA3.2-3B models across long-context benchmarks demonstrate that VQKV significantly outperforms existing state-of-the-art compression methods at similar compression ratios, highlighting its effectiveness in preserving information while substantially reducing the memory footprint of the KV cache.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 24276
Loading