Keywords: Implicit Bias, Spectral Descent, Muon
TL;DR: Implicit optimization bias of p-norm normalized steepest descent (NSD) and momentum steepest descent (NMD) including Spectral Descent and Muon as special cases
Abstract: Different gradient-based methods for optimizing overparameterized models can all
achieve zero training error yet converge to distinctly different solutions inducing
different generalization properties. We provide the first complete characterization
of implicit optimization bias for p-norm normalized steepest descent (NSD) and
momentum steepest descent (NMD) algorithms in multi-class linear classification
with cross-entropy loss. Our key theoretical contribution is proving that these algo-
rithms converge to solutions maximizing the margin with respect to the classifier
matrix's p-norm, with established convergence rates. These results encompass
important special cases including Spectral Descent and Muon, which we show
converge to max-margin solutions with respect to the spectral norm. A key insight
of our contribution is that the analysis of general entry-wise and Schatten p-norms
can be reduced to the analysis of NSD/NMD with max-norm by exploiting a natural
ordering property between all p-norms relative to the max-norm and its dual sum-
norm. For the specific case of descent with respect to the max-norm, we further
extend our analysis to include preconditioning, showing that Adam converges
to the matrix's max-norm solution. Our results demonstrate that the multi-class
linear setting, which is inherently richer than the binary counterpart, provides
the most transparent framework for studying implicit biases of matrix-parameter
optimization algorithms.
Student Paper: Yes
Submission Number: 23
Loading