How Does Orthogonalization Adapt to the Neural-Network Hessian Structure? A Gradient Self Outer-Product Analysis at Initialization
Keywords: Muon, Asymptotic Theory, Preconditioning, Row Normalization
TL;DR: We prove that the orthogonalization preconditioner asymptotically exhibits a row-block dominant structure at initialization.
Abstract: Muon orthogonalizes a weight matrix's momentum before each step,
and on neural networks this simple preconditioner beats entry-wise
optimizers by a wide margin. Most existing analyses, however, work
in a very abstract problem class, from which it is hard to see why
orthogonalization should be particularly suited to neural networks.
This work analyzes Muon's preconditioner in three concrete
neural-network settings. The layer-wise Hessian of a neural network
is known to be diagonally dominant within its row blocks, while
Muon's implicit preconditioner has the matching Kronecker form
$(\mathbf{V}\mathbf{V}^\top)^{1/2}\otimes\mathbf{I}$. The two align
exactly when $\mathbf{V}\mathbf{V}^\top$ is itself diagonal, which
raises a concrete question: when is $\mathbf{V}\mathbf{V}^\top$
(approximately) diagonal? To answer it, we compute
$E[\mathbf{G}\mathbf{G}^\top]$ (equivalently
$E[\mathbf{V}\mathbf{V}^\top]$ at initialization) in closed form
under Gaussian init, for three standard settings: symmetric matrix
factorization, deep linear networks, and two-layer ReLU networks.
In each case, the diagonal entries dominate the off-diagonal ones
as the width grows. Hence $\mathbf{V}\mathbf{V}^\top$ is
asymptotically diagonal: Muon's preconditioner aligns with the
Hessian's row-block structure.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 211
Loading