Beyond the Ideal: Analyzing the Inexact Muon Update

Published: 03 Feb 2026, Last Modified: 03 Feb 2026AISTATS 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We provide the first theoretical analysis of the practical, inexact Muon update, studying how approximation error affects its convergence and the optimal choice of learning rate and momentum.
Abstract: The Muon optimizer has rapidly emerged as a powerful, geometry-aware alternative to AdamW, demonstrating state-of-the-art performance in large-scale training of DNNs. A critical disconnect, however, exists between its theory and practice: Muon's efficiency relies on fast, approximate orthogonalization, yet all prior theoretical work analyzes an idealized, computationally intractable version assuming exact updates. This work moves beyond the ideal by providing the first analysis of the *inexact* orthogonalized update at Muon's core. We develop our analysis within the general framework of Linear Minimization Oracle (LMO)-based optimization, introducing a realistic additive error model to capture the inexactness of practical approximation schemes. Our analysis yields explicit bounds that quantify performance degradation as a function of the LMO inexactness/error, $\delta$. We reveal a fundamental coupling between this inexactness and the optimal step size and momentum, showing that the training strategy must adapt to the oracle's precision. These findings elevate the approximation procedure (e.g., the number of Newton-Schulz steps) from an implementation detail to a critical parameter that must be *co-tuned* with the learning schedule. Our theoretical insights are validated with experiments on vision and language models.
Submission Number: 2350
Loading