Spectral Equalization Minimizes Total Training Energy: A Control-Theoretic Account of Muon's Advantage

Euijin Hong

Spectral Equalization Minimizes Total Training Energy: A Control-Theoretic Account of Muon's Advantage

Euijin Hong

Published: 29 May 2026, Last Modified: 29 May 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: high-dimensional learning dynamics, optimizer dynamics, stochastic optimization, control theory, Muon optimizer, spectral equalization, total training energy, Kronecker Hessian, scaling laws

Abstract: We propose the total training energy $\mathcal E=\sum_t \|\delta_t\|^2$, the integral squared error of the back-propagated signal $\delta_t=\partial L_t/\partial y_t$, as a trajectory-level diagnostic for deep learning optimizers, viewing every first-order method as a discrete-time feedback controller closing the loop around the loss landscape. On quadratic losses with a Kronecker-factored Gauss-Newton Hessian $H=\Sigma_x\otimes\Sigma_\delta$, $\mathcal E$ equals the squared $\mathcal H_2$ norm of the closed loop and decomposes exactly into per-mode contributions $\mathcal E_{ij}=\lambda_i\mu_j a_{ij}(0)^2/(1-\rho_{ij}^2)$. Because $\rho\mapsto 1/(1-\rho^2)$ is strictly convex on $[0,1)$, Jensen's inequality implies that, among controllers with a common weighted-mean contraction rate, uniform per-mode rates strictly minimize $\mathcal E$ -- exactly what Muon's polar step $\operatorname{polar}(M_t)=UV^\top$ implements near the Kronecker eigenbasis. On MNIST with one-, two-, and three-layer MLPs trained with SGD, AdamW, and Muon, Muon attains the smallest Gini coefficient of the per-mode energy distribution $\{\mathcal E_{ij}\}$ on every monitored hidden matrix and the smallest cumulative energy on two of three, with the advantage concentrated in the low-curvature tail of the spectrum. The framework recasts "orthogonalizing momentum helps" as a measurable, mechanism-level claim about the geometry of the Hessian spectrum.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 97

Loading