Keywords: Spectral Descent, Shampoo, Gradient Descent, Generalization, Imbalanced Data, Linear models
TL;DR: We show that Spectral GD/Shampoo generalizes better than GD on imbalanced data.
Abstract: The growing adoption of spectrum-aware matrix-valued optimizers such as Shampoo and Muon in deep learning motivates a systematic study of their generalization properties and, in particular, when they might outperform competitive methods. We approach this challenging question by introducing appropriate simplifying abstractions as follows: First, we use imbalanced data as a testbed for studying the behavior of spectrum-aware optimizers. Second, we study the canonical form of such optimizers, which is Spectral Gradient Descent (SpecGD)—each update step is $\mathbf{U}\mathbf{V}^T$ where $\mathbf{U}\boldsymbol{\Sigma} \mathbf{V}^T$ is the truncated SVD of the gradient. Third, within this framework we identify a minimal linear setting where we can analyze when SpecGD outperforms vanilla GD. We show that unlike GD, which prioritizes learning majority classes first, SpecGD initially learns all principal components of the data at equal rates. We demonstrate how this translates to a growing gap in balanced accuracy favoring SpecGD early in training.
Student Paper: Yes
Submission Number: 78
Loading