QuEST: Training Accurate LLMs over Highly-Compressed Weights and Activation

Published: 05 Mar 2025, Last Modified: 04 Apr 2025SLLMEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 4 pages)
Keywords: efficiency, LLMs, quantization, sparsity, gradient estimation
TL;DR: Stable training of LLMs with 1-Bit weights and activations; optimal training with 2:4 INT4.
Abstract: One main approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, obtaining even more accurate compressed models by directly training over such representations, i.e., Quantization-Aware Training (QAT), is still largely open. In this paper, we advance this state-of-the-art for QAT via a new method called QuEST, which is Pareto-competitive with FP16, that is, it provides better accuracy at lower model size, while training models with weights and activations in 4-bits or less. Moreover, QuEST allows stable training with 1-bit weights and activations, and is compatible with weight sparsity. Experiments on Llama-type architectures show that QuEST induces new, stable scaling laws across the entire range of hardware-supported compressed representations. Moreover, we provide GPU kernel support showing that the models produced by QuEST can be efficiently executed on current hardware.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 41
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview