On Generalization of Spectral Gradient Descent: A Case Study on Imbalanced Data

Bhavya Vasudeva; Puneesh Deora; Christos Thrampoulidis

On Generalization of Spectral Gradient Descent: A Case Study on Imbalanced Data

Bhavya Vasudeva, Puneesh Deora, Christos Thrampoulidis

Published: 09 Jun 2025, Last Modified: 12 Jul 2025HiLD at ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Spectral Descent, Shampoo, Gradient Descent, Generalization, Imbalanced Data, Linear models

TL;DR: We show that Spectral GD/Shampoo generalizes better than GD on imbalanced data.

Abstract: The growing adoption of spectrum-aware matrix-valued optimizers such as Shampoo and Muon in deep learning motivates a systematic study of their generalization properties and, in particular, when they might outperform competitive methods. We approach this challenging question by introducing appropriate simplifying abstractions as follows: First, we use imbalanced data as a testbed for studying the behavior of spectrum-aware optimizers. Second, we study the canonical form of such optimizers, which is Spectral Gradient Descent (SpecGD)—each update step is $\mathbf{U}\mathbf{V}^T$ where $\mathbf{U}\boldsymbol{\Sigma} \mathbf{V}^T$ is the truncated SVD of the gradient. Third, within this framework we identify a minimal linear setting where we can analyze when SpecGD outperforms vanilla GD. We show that unlike GD, which prioritizes learning majority classes first, SpecGD initially learns all principal components of the data at equal rates. We demonstrate how this translates to a growing gap in balanced accuracy favoring SpecGD early in training.

Student Paper: Yes

Submission Number: 78

Loading