LoRDQ: activation-aware Low-Rank Decomposition and Quantization for Large Language Model Compression
Keywords: Model compression, post training quantization, low rank decomposition, weight-only quantization
Abstract: Large language models (LLMs) deliver high performance but remain prohibitively expensive to deploy in resource-constrained environments. Post-training quantization (PTQ) is widely used to reduce memory and compute, while it often degrades sharply in the ultra-low-bit regime. Although recent PTQ methods incorporate weight sensitivity for further improvement, the sensitivity analysis is often conducted at the element-, row-, or vector-wise level within the original weight matrix, which can limit robustness at very low bitwidths. We instead operate at the \emph{subspace} level by deriving an activation-aware low-rank factorization of each weight matrix (for a given layer/block). The key idea is to represent each weight matrix by a small set of activation-aware components that retain most output energy, and to solely quantize these factors, enabling higher precision per stored parameter under the same budget and improving accuracy in the low-bit regime. We thus propose \textbf{LoRDQ}, an activation-aware low-rank decomposition and quantization scheme that provides a closed-form factorization minimizing layer-output reconstruction, and incorporates two complementary techniques to mitigate the loss from quantizing low-rank factors, including a block-wise greedy decomposition and an intra-block compensation step. Simulations demonstrate that LoRDQ can achieve \(\sim\!10\times\) lower perplexity in comparison with existing methods such as GPTQ and AWQ. Moreover, leveraging our analytical results, we provide a {theoretical explanation} for these gains by connecting them to the spectrum of the output Gram matrix \(WXX^\top W^\top\), clarifying when low-rank structure preserves critical model behavior.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 18946
Loading