Benchmarking and Advancing Quantization-Aware Training for Reasoning Models

Keyu Lv; Manyi Zhang; Xiaobo Xia; Jingchen Ni; Shannan Yan; Xianzhi Yu; Lu Hou; Chun Yuan; Haoli Bai

Benchmarking and Advancing Quantization-Aware Training for Reasoning Models

Keyu Lv, Manyi Zhang, Xiaobo Xia, Jingchen Ni, Shannan Yan, Xianzhi Yu, Lu Hou, Chun Yuan, Haoli Bai

17 Sept 2025 (modified: 04 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Quantization-aware training, Reasoning models, Large language models

TL;DR: We provide a set of key insights on how to improve the quantization-aware training for reasoning models.

Abstract: Reasoning models have excelled at complex tasks such as coding and mathematical competitions, yet their reasoning processes suffer from low inference efficiency. Quantization is a popular way to boost efficiency, but prior work shows that it causes large performance drops in these models. To address this, we comprehensively benchmark the quantization-aware training (QAT) for reasoning models. Our key findings are: (1) knowledge distillation serves as a versatile objective for reasoning models trained with either supervised fine-tuning or reinforcement-learning algorithms; (2) post-training quantization (PTQ) provides a strong initialization for QAT, improving accuracy while reducing training cost; (3) QAT with reinforcement learning is feasible and yields additional gains for the quantized model; and (4) aligning the domain of QAT training data with the PTQ calibration data further improves the performance. Building on these insights, we propose Reasoning-QAT, an optimized QAT workflow tailored to reasoning models. Empirical results show that Reasoning-QAT outperforms state-of-the-art PTQ methods across multiple LLM backbones and reasoning datasets. For instance, on the DeepSeek-R1-Qwen-Distill-1.5B model, Reasoning-QAT surpasses FlatQuant by 2.92\% under W4A4KV4 quantization and GPTQ by 4.74\% under W3G128 quantization, respectively.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 8253

Loading