Keywords: Hyperparameter transfer, model scaling, feature learning, optimizer design, distributed optimization, Muon
TL;DR: We show that Muon with a spectral norm constraint and blockwise orthogonalization enables parallel training while outperforming the original Muon optimizer. Furthermore, it exhibits learning rate transfer across model depth, width, and token count.
Abstract: Muon is a recent optimizer that relies on matrix orthogonalization of updates and has been shown to improve large language model (LLM) training. It does so by introducing additional momentum and Newton-Schulz iteration to the stochastic spectral descent method (SSD). However, it incurs higher communication cost if tensor parallelism is enabled, and its hyperparameter transfer properties
are not yet fully explored.
We first introduce block-wise orthogonalization, splitting weight matrices into independent tiles that are orthogonalized separately
and recombined and empirically analyze its influence on training. This retains the validation
loss while allowing up to $16$x tensor parallel splits of weight matrices.
Second, we show that under spectral regularization a single learning rate transfers when
depth, width of the model, and token count are co-scaled under Chinchilla guidelines.
Finally, we show that a higher weight decay value of $0.1$ underperforms during the first 80\% of the training but outperforms lower values after that, which can be attributed to the tighter spectral norm constraint. Based on this, we propose weight decay clipping and scheduling to capture both regimes.
Overall, we demonstrate experimentally for nanoGPT models from 124M to 1.4B parameters that spectral regularization, both with block-wise and full-matrix orthogonalization, allows for learning rate transfer across multiple scaling dimensions and better generalization with weight decay due to the tighter spectral norm constraint.
Student Paper: No
Submission Number: 16
Loading