High-dimensional isotropic scaling dynamics of Muon and SGD

Published: 22 Sept 2025, Last Modified: 01 Dec 2025NeurIPS 2025 WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: muon, sgd, stochastic optimization, deep learning theory, high-dimensional probability, random matrix theory
TL;DR: We investigate the isotropic scaling dynamics of Muon vs. SGD in a matrix-valued linear regression setting.
Abstract: Recent developments in neural network optimization have brought a renewed interest to non-diagonal preconditioning methods. Muon is a promising algorithm which uses approximate orthogonalization of matrix-valued updates to efficiently traverse poorly conditioned loss landscapes. However, the theoretical underpinnings of Muon's performance, particularly in high-dimensional regimes, remain underexplored. This paper investigates the isotropic scaling dynamics of Muon compared to SGD in a matrix-valued linear regression setting. We derive risk recursion equations for both optimizers under isotropic data assumptions, and find the correct scaling rules for increasing batch size with dimension for efficient training. Our work also suggests that in the high dimensional limit, Muon's default normalization may not be sufficient to maintain its nonlinear properties.
Submission Number: 132
Loading