Structured Preconditioners in Adaptive Optimization: A Unified Analysis

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We present a novel unified analysis for a broad class of adaptive optimization algorithms with structured (e.g., layerwise, diagonal, and kronecker-factored) preconditioners for both online regret minimization and offline convex optimization. Our analysis not only provides matching rate to several important structured preconditioned algorithms including diagonal AdaGrad, full-matrix AdaGrad, and AdaGrad-Norm, but also gives an improved convergence rate for a one-sided variant of Shampoo over that of original Shampoo. Interestingly, more structured preconditioners (e.g., diagonal Adagrad, AdaGrad-Norm which use less space and compute) are often presented as computationally efficient approximations to full-matrix Adagrad, aiming for improved optimization performance through better approximations. Our unified analysis challenges this prevailing view and reveals, perhaps surprisingly, that more structured preconditioners, despite using less space and computation per step, can outperform their less structured counterparts. To demonstrate this, we show that one-sided Shampoo, which is relatively much cheaper than full-matrix AdaGrad could outperform it both theoretically and experimentally.
Lay Summary: When training machine learning models, especially large ones, we use tools called optimizers to help the model learn quickly and accurately. One popular optimizer is AdaGrad, which adjusts how fast each part of the model learns based on how difficult it is to train. A more powerful version called full-matrix AdaGrad learns even faster but is too expensive to run on today’s large models because it requires too much memory and computation. To solve this, researchers have designed simplified versions that use less memory by making structured approximations. Common wisdom assumes that these simpler versions are just cheaper but not as effective. In our work, we challenge that belief. We show that some of these cheaper, more structured optimizers — like AdaGrad-Norm and a version of Shampoo we call “one-sided Shampoo” — can actually perform better than the more complex ones for some specific settings, both in theory and in practice. We also provide the first unified mathematical framework that explains why this happens. This research helps us better understand how to speed up training of large models efficiently, without sacrificing performance — an important step toward making powerful AI systems more accessible and sustainable.
Primary Area: Optimization
Keywords: Shampoo, adaptive optimization, layerwise
Submission Number: 13872
Loading