TL;DR: We perform weight/activation/kv cache quantization of large language models to 4-bit while keeping only 1/8 channels in 8-bit.
Abstract: Post-training quantization (PTQ) of large language models (LLMs) holds the promise in reducing the prohibitive computational cost at inference time. Quantization of all weight, activation and key-value (KV) cache tensors to 4-bit without significantly degrading generalizability is challenging, due to the high quantization error caused by extreme outliers in activations. To tackle this problem, we propose ResQ, a PTQ method that pushes further the state-of-the-art. By means of principal component analysis (PCA), it identifies a low-rank subspace (in practice 1/8 of the hidden dimension) in which activation variances are highest, and keep the coefficients within this subspace in high precision, e.g.~8-bit, while quantizing the rest to 4-bit. Within each subspace, invariant random rotation is applied to further suppress outliers. We show that this is a provably optimal mixed precision quantization scheme that minimizes error. With the Llama and Qwen2.5 families of models, we demonstrate that ResQ outperforms recent uniform and mixed precision PTQ methods on a variety of benchmarks, achieving up to 33\% lower perplexity on Wikitext than the next best method SpinQuant, and upto 3X speedup over 16-bit baseline. Anonymous code repository available at https://anonymous.4open.science/r/project-resq-2142.
Lay Summary: ResQ: Making AI Models Smaller, Faster, and Smarter
Large language models (LLMs) like ChatGPT and Google Bard are incredibly powerful, but they’re also massive and resource-hungry. Running these models quickly and efficiently is a big challenge, especially on limited hardware like laptops or smartphones. That’s where ResQ comes in. ResQ is a new method that makes these giant models much smaller and faster without sacrificing their brainpower. It does this by a smart trick: it figures out which parts of the model need more attention and keeps those in high detail, while safely shrinking the less important parts. By combining mathematical techniques (like PCA and random rotations) with efficient hardware implementation, ResQ can shrink models to just 4-bit precision - 4 times smaller compared to the usual 16 bits - while keeping performance strong. It results in up to 5× faster processing and much lower memory use, all while keeping AI models smart enough for tasks like language understanding, reasoning, and even multi-modal tasks (like combining text with images). ResQ’s approach is also robust as it works across different models like LLaMA and Qwen, and can even run huge models on a single graphics card instead of needing multiple high-end servers. This could help make AI more accessible, running efficiently on everyday devices. In summary, ResQ makes AI models smaller, faster, and more efficient, paving the way for smarter AI everywhere.
Link To Code: https://github.com/utkarsh-dmx/project-resq
Primary Area: Deep Learning->Large Language Models
Keywords: quantization, large language models, low precision computation, kv cache compression, efficient language models
Submission Number: 11902
Loading