How Does Orthogonalization Adapt to the Neural-Network Hessian Structure? A Gradient Self Outer-Product Analysis at Initialization

Published: 29 May 2026, Last Modified: 29 May 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Muon, Asymptotic Theory, Preconditioning, Row Normalization
TL;DR: We prove that the orthogonalization preconditioner asymptotically exhibits a row-block dominant structure at initialization.
Abstract: Muon orthogonalizes a weight matrix's momentum before each step, and on neural networks this simple preconditioner beats entry-wise optimizers by a wide margin. Most existing analyses, however, work in a very abstract problem class, from which it is hard to see why orthogonalization should be particularly suited to neural networks. This work analyzes Muon's preconditioner in three concrete neural-network settings. The layer-wise Hessian of a neural network is known to be diagonally dominant within its row blocks, while Muon's implicit preconditioner has the matching Kronecker form $(\mathbf{V}\mathbf{V}^\top)^{1/2}\otimes\mathbf{I}$. The two align exactly when $\mathbf{V}\mathbf{V}^\top$ is itself diagonal, which raises a concrete question: when is $\mathbf{V}\mathbf{V}^\top$ (approximately) diagonal? To answer it, we compute $E[\mathbf{G}\mathbf{G}^\top]$ (equivalently $E[\mathbf{V}\mathbf{V}^\top]$ at initialization) in closed form under Gaussian init, for three standard settings: symmetric matrix factorization, deep linear networks, and two-layer ReLU networks. In each case, the diagonal entries dominate the off-diagonal ones as the width grows. Hence $\mathbf{V}\mathbf{V}^\top$ is asymptotically diagonal: Muon's preconditioner aligns with the Hessian's row-block structure.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 211
Loading