Keywords: Efficient model architectures
TL;DR: We propose SRR, a quantization method that preserves dominant subspaces before quantizing residuals. It naturally generalizes to activation-aware settings and improves PTQ and QPEFT performance under low-bit constraints.
Abstract: Post-training quantization (PTQ) enables efficient deployment of LLMs by converting weights to low-bit formats, but often degrades accuracy.
Quantization error reconstruction (QER) mitigates this by adding a low-rank correction term.
However, existing QER methods typically quantize weights before identifying low-rank structure, discarding information they later attempt to recover.
We propose Structured Residual Reconstruction (SRR), a simple yet effective reformulation of QER that first preserves dominant spectral directions and quantizes only the residual tail.
The final approximation combines the preserved low-rank structure with a quantized residual, yielding improved fidelity under the same rank constraint.
SRR generalizes to activation-aware settings by selecting dominant components based on contributions in both the original and activation-weighted spaces.
We also apply SRR in QPEFT by freezing the preserved subspace and updating only the residual component during fine-tuning, which stabilizes training and leads to better adaptation.
Across both PTQ and QPEFT, SRR consistently improves performance under fixed rank constraints, providing an effective framework for quantization-aware compression.
Submission Number: 45
Loading