XQuant: Pushing Low-Bit KV Cache Quantization to the Limit with Cross-Layer Compression

XQuant: Pushing Low-Bit KV Cache Quantization to the Limit with Cross-Layer Compression

ACL ARR 2025 February Submission6505 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks. However, their extensive memory requirements stemming from KV cache growth, especially during long-text understanding and generation, pose significant challenges for real-world deployment in resource-constrained environments. Quantization, as a promising approach that preserves historical information while reducing memory consumption, has garnered significant attention and expectations. We present XQuant, a training-free and plug-and-play framework that pushes KV cache quantization to ultra-low equivalent bit-width. XQuant introduces two key improvements over existing quantization methods: a computationally negligible data-free calibration approach and cross-layer KV cache compression, enabling ultra-low equivalent bit-width. Extensive experiments on TruthfulQA and LongBench demonstrate that XQuant achieves lower equivalent bit-width (< 1.4 bits) across various large language models compared to KIVI-2bit and AsymKV-1.5bit baselines, while attaining superior performance metrics, establishing a better trade-off between model performance and compression ratio.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: quantization

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 6505

Loading