QuEST: Stable Training of LLMs with 1-Bit Weights and Activations

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We show that LLMs can be stably trained down to 1-bit, and are optimally trained at 4-bit weights and activations via a new quantized gradient estimation technique.
Abstract: One approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, the question of obtaining even more accurate compressed models by *directly training* over such representations, i.e., *Quantization-Aware Training (QAT)*, is still open: for example, a recent study put the "optimal" bit-width at which models can be trained using QAT, while staying accuracy-competitive with standard FP16/BF16 precision, at 8-bits weights and activations. We advance this state-of-the-art via a new method called QuEST, for which we demonstrate optimality at 4-bits and stable convergence as low as 1-bit weights and activations. QuEST achieves this by improving two key aspects of QAT methods: (1) accurate and fast quantization of the (continuous) distributions of weights and activations via Hadamard normalization and MSE-optimal fitting; (2) a new *trust gradient estimator* based on the idea of explicitly minimizing the error between the noisy gradient computed over quantized states and the "true" (but unknown) full-precision gradient. Experiments on Llama-type architectures show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that models produced by QuEST can be executed efficiently. Our code is available at [https://github.com/IST-DASLab/QuEST](https://github.com/IST-DASLab/QuEST).
Lay Summary: Large Language Models (LLMs) are expensive to train and use. Low-precision data types offer a way to reduce costs. We propose a novel method to train LLMs directly using these data types, which yields improved model accuracy. We demonstrate that our method achieves the best accuracy-to-cost tradeoff at around 4-bit precision and is stable for even lower precision.
Link To Code: https://github.com/IST-DASLab/QuEST
Primary Area: Deep Learning->Algorithms
Keywords: efficiency, LLMs, quantization, gradient estimation
Submission Number: 11564
Loading