Convergence Bound and Critical Batch Size of Muon Optimizer

Convergence Bound and Critical Batch Size of Muon Optimizer

06 Mar 2026 (modified: 23 Apr 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Muon, a recently proposed optimizer that leverages the inherent matrix structure of neural network parameters, has demonstrated strong empirical performance, indicating its potential as a successor to standard optimizers such as AdamW. This paper presents theoretical analysis to support its practical success. We provide convergence proofs for Muon across four practical settings, systematically examining its behavior with and without the inclusion of Nesterov momentum and weight decay. We then demonstrate that the addition of weight decay ensures almost-sure boundedness of the parameter and gradient norms---without relying on the commonly imposed bounded-gradient assumption---and clarify the interplay between the weight decay coefficient and the learning rate. Finally, we derive a lower bound on the critical batch size for Muon---the batch size that minimizes the stochastic first-order oracle (SFO) complexity of training. Because the resulting formula involves problem-dependent quantities that are not directly observable (gradient variance, target precision, effective rank), it does not predict the critical batch size in absolute terms; rather, it reveals how the hyperparameters $\beta$ (momentum) and $\lambda$ (weight decay) govern the qualitative scaling of this value. Our experiments validate these hyperparameter-dependent predictions across workloads including image classification and language modeling.

Submission Type: Long submission (more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=31oMHlGSmV

Changes Since Last Submission: To address the concerns regarding the persuasiveness of our theoretical framework, we have significantly refined the convergence analysis and the derivation of the critical batch size. Specifically, by employing the matrix Hölder inequality $\|\langle A, B \rangle\_{\rm F}| \leq \Vert A\Vert\_* \Vert B\Vert\_{\rm op}$, we improved the convergence rate from $O(1/T + \sqrt{(1-\beta)r/b} + n)$ to $O(1/T + \sqrt{(1-\beta)r/b} + n\eta)$. Unlike the previous $O(n)$ term, which implied a fundamental limit to convergence, the new $O(n\eta)$ bound demonstrates that the residual error is coupled with the step size. This implies that practitioners can suppress the final optimization error to an arbitrary level by configuring a sufficiently small learning rate, marking a significant improvement over the fixed error floor identified in the prior version. Furthermore, we updated the derivation of the critical batch size and revised the threshold condition from $X/T + Y/b < \epsilon$ to $X/T + Y/b + Z < \epsilon$. This revision yields a more robust and accurate formula for the critical batch size, addressing the reviewers' concerns. Please note that the critical batch size formula itself has also been updated accordingly to reflect these improvements in the convergence theorem. These changes eliminate the weaknesses pointed out in the previous review and strengthen the overall consistency and persuasiveness of our results. ================================================ We corrected several minor typographical errors and improved wording in a few sentences. No technical content or results were changed. ================================================

Assigned Action Editor: ~Fanhua_Shang2

Submission Number: 7803

Loading