Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks.
However, their extensive memory requirements stemming from KV cache growth, especially during long-text understanding and generation, pose significant challenges for real-world deployment in resource-constrained environments.
Quantization, as a promising approach that preserves historical information while reducing memory consumption, has garnered significant attention and expectations.
We present XQuant, a training-free and plug-and-play framework that pushes KV cache quantization to ultra-low equivalent bit-width.
XQuant introduces two key improvements over existing quantization methods: a computationally negligible data-free calibration approach and cross-layer KV cache compression, enabling ultra-low equivalent bit-width.
Extensive experiments on TruthfulQA and LongBench demonstrate that XQuant achieves lower equivalent bit-width (< 1.4 bits) across various large language models compared to KIVI-2bit and AsymKV-1.5bit baselines, while attaining superior performance metrics, establishing a better trade-off between model performance and compression ratio.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: quantization
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 6505
Loading