Keywords: LoRA, quantization, Large language models(LLMs), Alternating Least Squares (ALS)
TL;DR: This paper presents a novel approach that enhances quantization performance for Large Language Models (LLMs) by improving low-rank matrix modeling with activation values and Alternating Least Squares (ALS).
Abstract: The rapid advancement of Large Language Models (LLMs) has intensified the demand for efficient methodologies that balance model performance with hardware constraints, particularly GPU memory limitations. Quantization has emerged as a prominent technique for model compression, with QLoRA demonstrating the potential of low-rank matrices for quantization error compensation by integrating LoRA-based efficient fine-tuning. However, even LoRA fine-tuning requires substantial resources for models with tens or hundreds of billions of parameters. In this work, we explore low-rank matrix compensation for quantization errors without global LoRA fine-tuning, employing Alternating Least Squares (ALS) to better model and solve the optimization problem.
We introduce a novel approach that refines low-rank matrix modeling by incorporating activation values and optimizing them directly through ALS, particularly under low-bit quantization conditions. Furthermore, we revisit the quantization interval partitioning in Round-to-Nearest (RTN) methods by introducing scaling factors that transform the discontinuous truncation function into a continuous optimization problem, thereby enhancing quantization performance through more rational interval adjustment. Extensive experimental evaluations support our theoretical contributions.
Our research reveals how low-rank matrices can effectively capture the intrinsic information of large models, overcoming limitations of traditional SVD-based approaches. Comprehensive experiments across standard benchmarks consistently show that our method outperforms state-of-the-art quantization techniques, providing a principled, data-driven framework for understanding low-rank structure's role in quantization error compensation. This advancement represents a significant step toward practical LLM deployment, offering more efficient and effective model compression strategies.
Primary Area: optimization
Submission Number: 22650
Loading