Adaptive Gradient Methods Can Be Provably Faster than SGD with Random Shuffling

Xunpeng Huang; Vicky Jiaqi Zhang; Hao Zhou; Lei Li

Adaptive Gradient Methods Can Be Provably Faster than SGD with Random Shuffling

Xunpeng Huang, Vicky Jiaqi Zhang, Hao Zhou, Lei Li

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Abstract: Adaptive gradient methods have been shown to outperform SGD in many tasks of training neural networks. However, the acceleration effect is yet to be explained in the non-convex setting since the best convergence rate of adaptive gradient methods is worse than that of SGD in literature. In this paper, we prove that adaptive gradient methods exhibit an $\small\tilde{O}(T^{-1/2})$-convergence rate for finding first-order stationary points under the strong growth condition, which improves previous best convergence results of adaptive gradient methods and random shuffling SGD by factors of $\small O(T^{-1/4})$ and $\small O(T^{-1/6})$, respectively. In particular, we study two variants of AdaGrad with random shuffling for finite sum minimization. Our analysis suggests that the combination of random shuffling and adaptive learning rates gives rise to better convergence.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Supplementary Material: zip

Reviewed Version (pdf): https://openreview.net/references/pdf?id=DUQAoROV9

20 Replies

Loading