Towards Understanding Orthogonalization in Muon

Published: 11 Jun 2025, Last Modified: 10 Jul 2025ES-FoMo IIIEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Hyperparameter transfer, model scaling, feature learning, optimizer design, distributed optimization, Muon
TL;DR: We show that Muon with a spectral norm constraint and blockwise orthogonalization enables parallel training while outperforming the original Muon optimizer. Furthermore, it exhibits learning rate transfer across model depth, width, and token count.
Abstract: Muon is a recent optimizer that relies on matrix orthogonalization of updates and has been shown to improve large language model (LLM) training. However, it incurs higher communication cost if tensor parallelism is enabled, and its hyperparameter transfer properties are not yet fully explored. We first introduce block-wise orthogonalization, splitting weight matrices into independent tiles that are orthogonalized separately and recombined, and we empirically analyze its influence on training. This retains the validation loss while allowing up to $16$x tensor parallel splits of weight matrices. Second, we show that under spectral regularization a single learning rate transfers when depth, width of the model, and token count are co-scaled under Chinchilla guidelines. Finally, we show that a higher weight decay value of $0.1$ underperforms during the first 80\% of the training but outperforms lower values after that, which can be attributed to the tighter spectral norm constraint. The code is available at https://anonymous.4open.science/r/MuonSBW-23A2.
Submission Number: 49
Loading