LoRDQ: activation-aware Low-Rank Decomposition and Quantization for Large Language Model Compression

LoRDQ: activation-aware Low-Rank Decomposition and Quantization for Large Language Model Compression

ICLR 2026 Conference Submission18946 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Model compression, post training quantization, low rank decomposition, weight-only quantization

Abstract: Large language models (LLMs) deliver high performance but remain prohibitively expensive to deploy in resource-constrained environments. Post-training quantization (PTQ) is widely used to reduce memory and compute, while it often degrades sharply in the ultra-low-bit regime. Although recent PTQ methods incorporate weight sensitivity for further improvement, the sensitivity analysis is often conducted at the element-, row-, or vector-wise level within the original weight matrix, which can limit robustness at very low bitwidths. We instead operate at the \emph{subspace} level by deriving an activation-aware low-rank factorization of each weight matrix (for a given layer/block). The key idea is to represent each weight matrix by a small set of activation-aware components that retain most output energy, and to solely quantize these factors, enabling higher precision per stored parameter under the same budget and improving accuracy in the low-bit regime. We thus propose \textbf{LoRDQ}, an activation-aware low-rank decomposition and quantization scheme that provides a closed-form factorization minimizing layer-output reconstruction, and incorporates two complementary techniques to mitigate the loss from quantizing low-rank factors, including a block-wise greedy decomposition and an intra-block compensation step. Simulations demonstrate that LoRDQ can achieve \(\sim\!10\times\) lower perplexity in comparison with existing methods such as GPTQ and AWQ. Moreover, leveraging our analytical results, we provide a {theoretical explanation} for these gains by connecting them to the spectrum of the output Gram matrix \(WXX^\top W^\top\), clarifying when low-rank structure preserves critical model behavior.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 18946

Loading