Low-Rank Decomposition Assisted Quantization and Inference Compensation for Quality Large Language Model Inference

Published: 2025, Last Modified: 14 Mar 2026IJCNN 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Large Language Models (LLMs) have demonstrated exceptional performance on natural language processing tasks. However, these models are computationally intensive and require substantial hardware resources for deployment. Quantization has emerged as a popular technique for LLM deployment, reducing memory requirements, but it results in accuracy degradation, particularly when using low-bit quantization. To mitigate this accuracy loss, we introduce Low-Rank Compensation (LoRC), a novel compensation mechanism that aims to recover the performance drop caused by quantization. Additionally, we propose Low-Rank Quantization (LoRQ), which further reduces the quantization-induced loss by adaptively adjusting weights at the element-wise level to help LLMs accommodate quantized computations. LoRC focuses on compensating for accuracy loss during inference, LoRQ integrates low-rank compensation directly into the quantization process, and they do not need end-to-end fine-tuning with LLM. Furthermore, we propose the Rank-α Addition Strategy (RαAS) to combine LoRC into the inference framework, which improves inference accuracy without increasing inference latency. Experimental results show that our method outperforms the state-of-the-art OmniQuant by 1.89% on several common zero-shot datasets under the W4A4 setting of the widely-used LLaMA. Through the joint design of algorithms and systems, our techniques can be easily integrated into the FlexGen inference framework without introducing additional inference latency, thereby maintaining high throughput while improving accuracy.
Loading