Towards understanding of orthogonalization in Muon

Valentyn Boreiko; Zhiqi Bu; Sheng Zha

Towards understanding of orthogonalization in Muon

Valentyn Boreiko, Zhiqi Bu, Sheng Zha

Published: 09 Jun 2025, Last Modified: 09 Jun 2025HiLD at ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Hyperparameter transfer, model scaling, feature learning, optimizer design, distributed optimization, Muon

TL;DR: We show that Muon with a spectral norm constraint and blockwise orthogonalization enables parallel training while outperforming the original Muon optimizer. Furthermore, it exhibits learning rate transfer across model depth, width, and token count.

Abstract: Muon is a recent optimizer that relies on matrix orthogonalization of updates and has been shown to improve large language model (LLM) training. It does so by introducing additional momentum and Newton-Schulz iteration to the stochastic spectral descent method (SSD). However, it incurs higher communication cost if tensor parallelism is enabled, and its hyperparameter transfer properties are not yet fully explored. We first introduce block-wise orthogonalization, splitting weight matrices into independent tiles that are orthogonalized separately and recombined and empirically analyze its influence on training. This retains the validation loss while allowing up to $16$x tensor parallel splits of weight matrices. Second, we show that under spectral regularization a single learning rate transfers when depth, width of the model, and token count are co-scaled under Chinchilla guidelines. Finally, we show that a higher weight decay value of $0.1$ underperforms during the first 80\% of the training but outperforms lower values after that, which can be attributed to the tighter spectral norm constraint. Based on this, we propose weight decay clipping and scheduling to capture both regimes. Overall, we demonstrate experimentally for nanoGPT models from 124M to 1.4B parameters that spectral regularization, both with block-wise and full-matrix orthogonalization, allows for learning rate transfer across multiple scaling dimensions and better generalization with weight decay due to the tighter spectral norm constraint.

Student Paper: No

Submission Number: 16

Loading