Keywords: Quantization-aware training, Reasoning models, Large language models
TL;DR: We provide a set of key insights on how to improve the quantization-aware training for reasoning models.
Abstract: Reasoning models have excelled at complex tasks such as coding and mathematical competitions, yet their reasoning processes suffer from low inference efficiency. Quantization is a popular way to boost efficiency, but prior work shows that it causes large performance drops in these models. To address this, we comprehensively benchmark the quantization-aware training (QAT) for reasoning models. Our key findings are: (1) knowledge distillation serves as a versatile objective for reasoning models trained with either supervised fine-tuning or reinforcement-learning algorithms; (2) post-training quantization (PTQ) provides a strong initialization for QAT, improving accuracy while reducing training cost; (3) QAT with reinforcement learning is feasible and yields additional gains for the quantized model; and (4) aligning the domain of QAT training data with the PTQ calibration data further improves the performance. Building on these insights, we propose Reasoning-QAT, an optimized QAT workflow tailored to reasoning models. Empirical results show that Reasoning-QAT outperforms state-of-the-art PTQ methods across multiple LLM backbones and reasoning datasets. For instance, on the DeepSeek-R1-Qwen-Distill-1.5B model, Reasoning-QAT surpasses FlatQuant by 2.92\% under W4A4KV4 quantization and GPTQ by 4.74\% under W3G128 quantization, respectively.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 8253
Loading