On quantizing the state of the Muon optimizer

Aman Gupta; Shao Tang; Gregory Dexter; Abhishek Shivanna; Rafael Celente; Rohan Ramanath; Daniel Silva; Hiroto Udagawa; Daniel Thomas Braithwaite; Sathiya Keerthi

On quantizing the state of the Muon optimizer

Aman Gupta, Shao Tang, Gregory Dexter, Abhishek Shivanna, Rafael Celente, Rohan Ramanath, Daniel Silva, Hiroto Udagawa, Daniel Thomas Braithwaite, Sathiya Keerthi

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: optimization, muon, llms, quantization

TL;DR: An 8-bit quantized version of the Muon optimizer saves memory and is competitive in training

Abstract: The Muon optimizer, based on matrix orthogonalization, has recently shown faster convergence and up to 2× computational efficiency over AdamW in LLM pretraining. Like AdamW, Muon is stateful, requiring storage of both model weights and accumulated gradients. While 8-bit AdamW variants mitigate this overhead using blockwise quantization, they are typically stable only under dynamic quantization - which improves stability on linear quantization for extreme values. In this paper, we introduce the 8-bit Muon optimizer using blockwise quantization, supporting both linear and dynamic schemes. We demonstrate that 8-bit Muon maintains stability under both, while delivering $\sim$74\% reduction in memory footprint compared to full-precision Muon. In extensive experiments, 8-bit Muon with linear quantization outperforms AdamW and 8-bit AdamW in pre-training a 1.6B model on 4B FineWeb tokens, while achieving parity with Muon for Chinchilla-optimal training with 32B tokens for both validation loss and downstream benchmark tasks. It also shows competitive results when fine-tuning the Llama 3.2 3B model on post-training data. We also provide a theoretical perspective to help explain this robustness under quantization.

Primary Area: optimization

Submission Number: 22234

Loading