Keywords: optimization, whitening, shampoo, muon
Abstract: A range of recent optimizers have emerged that approximate the same "matrix-whitening" transformation in various ways. In this work, we systematically deconstruct such optimizers, aiming to disentangle the key components that explain performance. Under tuned hyperparameters across the board, all flavors of matrix-whitening methods reliably outperform their elementwise counterparts, such as Adam. Matrix-whitening is often related to spectral descent -- however, metrics reveal that performance gains are *not explained solely by accurate spectral normalization* -- particularly, SOAP displays the largest per-step gain, even though Muon more accurately descends along the steepest spectral descent direction. Instead, we argue that matrix-whitening serves *two* purposes, and the variance-adaptation component of matrix-whitening is the overlooked ingredient explaining this performance gap. Experiments show that variance-adapted versions of optimizers consistently outperform their sign-descent counterparts, including an adaptive version of Muon. We further ablate variance adaptation strategies, finding that while "lookahead" style approximations are not as effective, low-rank variance estimators can reduce memory costs without a performance loss.
Primary Area: optimization
Submission Number: 23003
Loading