On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AdamW, Adam, convergence analysis, nonconvex optimization
TL;DR: This paper establishes the convergence rate $\frac{1}{K}\sum_{k=1}^K E\left[\|\nabla f(x^k)\|_1\right]\leq O(\frac{\sqrt{d}C}{K^{1/4}})$ for AdamW.
Abstract: As the default optimizer for training large language models, AdamW has achieved remarkable success in deep learning. However, its convergence behavior is not theoretically well-understood. This paper establishes the convergence rate $\frac{1}{K}\sum_{k=1}^K E[||\nabla f(x^k)||_1]\leq O(\frac{\sqrt{d}C}{K^{1/4}})$ for AdamW measured by $\ell_1$ norm, where K represents the iteration number, d denotes the model dimension, and C matches the constant in the optimal convergence rate of SGD. Theoretically, we have $E[||\nabla f(x)||_1]\geq\sqrt{\frac{2d}{\pi}}E[||\nabla f(x)||_2]$ when each element of $\nabla f(x)$ is generated from Gaussian distribution $\mathcal N(0,1)$. Empirically, our experimental results on real-world deep learning tasks reveal $||\nabla f(x)||_1=\varTheta(\sqrt{d})||\nabla f(x)||_2$. Both support that our convergence rate can be considered to be analogous to the optimal convergence rate of SGD.
Supplementary Material: zip
Primary Area: Optimization (e.g., convex and non-convex, stochastic, robust)
Submission Number: 7230
Loading