Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions

ICLR 2025 Conference Submission12989 Authors

28 Sept 2024 (modified: 13 Oct 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Adaptive gradient methods, stochastic nonconvex optimization, dimensional dependence
TL;DR: We analyzed the convergence rate of AdaGrad measured by the gradient $\ell_1$-norm, and proved that it can shave a factor of $d$ from the complexity of SGD under favorable settings.
Abstract: Adaptive gradient methods, such as AdaGrad, are among the most successful optimization algorithms for neural network training. While these methods are known to achieve better dimensional dependence than stochastic gradient descent (SGD) under favorable geometry for stochastic convex optimization, the theoretical justification for their success in stochastic non-convex optimization remains elusive. In fact, under standard assumptions of Lipschitz gradients and bounded noise variance, it is known that SGD is worst-case optimal (up to absolute constants) in terms of finding a near-stationary point measured by the $\ell_2$-norm, making further improvements impossible. Motivated by this limitation, we introduce refined assumptions on the smoothness structure of the objective and the gradient noise variance, which better suit the coordinate-wise nature of adaptive gradient methods. Moreover, we adopt the $\ell_1$-norm of the gradient as the stationarity measure, as opposed to the standard $\ell_2$-norm, to align with the coordinate-wise analysis and obtain tighter convergence guarantees for AdaGrad. Under these new assumptions and the $\ell_1$-norm stationarity measure, we establish an **upper bound** on the convergence rate of AdaGrad and a corresponding **lower bound** for SGD. In particular, for certain configurations of problem parameters, we show that the iteration complexity of AdaGrad outperforms SGD by a factor of $d$. To the best of our knowledge, this is the first result to demonstrate a provable gain of adaptive gradient methods over SGD in a non-convex setting. We also present supporting lower bounds, including one specific to AdaGrad and one applicable to general deterministic first-order methods, showing that our upper bound for AdaGrad is tight and unimprovable up to a logarithmic factor under certain conditions.
Primary Area: optimization
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12989
Loading