ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

Published: 05 Mar 2025, Last Modified: 21 Apr 2025SLLMEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 4 pages)
Keywords: quantization, weight quantization, activation quantization, kv cache quantization, llm quantization, efficient inference
TL;DR: We perform weight/activation/kv cache quantization of large language models to 4-bit while keeping only 1/8 channels in 8-bit.
Abstract: Quantizing weights, activations, and KV cache in large language models to 4-bit without degrading generalizability is challenging due to outlier-induced activation quantization errors. We propose ResQ, a post training quantization (PTQ) method that uses principal component analysis to identify a low-rank subspace (in practice 1/8 of the hidden dimension) and keeps coefficients within this subspace in 8-bit while quantizing the rest in 4-bit. Within each subspace, invariant random rotation is applied to further suppress outliers. ResQ outperforms recent PTQ methods on Llama and Qwen2.5, achieving up to 33% lower Wikitext perplexity than SpinQuant and up to 3x speedup over 16-bit.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 54
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview