On quantizing the state of the Muon optimizer

Aman Gupta; Rafael Celente; Abhishek Shivanna; Daniel Thomas Braithwaite; Gregory Dexter; Shao Tang; Hiroto Udagawa; Daniel Silva; Rohan Ramanath; Sathiya Keerthi

On quantizing the state of the Muon optimizer

Aman Gupta, Rafael Celente, Abhishek Shivanna, Daniel Thomas Braithwaite, Gregory Dexter, Shao Tang, Hiroto Udagawa, Daniel Silva, Rohan Ramanath, Sathiya Keerthi

Published: 03 Mar 2026, Last Modified: 03 Mar 2026SPOTEveryoneRevisionsBibTeXCC BY 4.0

Keywords: quantization, muon, optimization, deep learning, pretraining

TL;DR: We show that the Muon optimizer is inherently robust to simple 8-bit linear quantization, allowing for a 62% reduction in memory footprint with zero loss in convergence or fine-tuning performance.

Abstract: The Muon optimizer, based on matrix orthogonalization, has recently shown faster convergence and better computational efficiency over AdamW in LLM pre-training. However, the memory overhead of maintaining high-precision optimizer states remains a challenge for large- scale deployment. In this paper, we introduce the 8-bit Muon optimizer using blockwise quantization. In extensive Chinchilla-optimal experiments on pre-training models of up to 2.7B in size and fine-tuning them for instruction following, we demonstrate that 8-bit Muon achieves parity with Muon in terms of validation loss and downstream benchmarks, while achieving up to a 62% reduction in optimizer state footprint. Crucially, we show that Muon’s update mechanism is uniquely compatible with a simple linear quantization scheme, bypassing the complex dynamic scaling required for quantized AdamW. We supplement our empirical findings with a theoretical analysis of Muon’s robustness to quantization noise.

Submission Number: 74

Loading