Towards Understanding the Role of Adaptive Learning Rates in Powered Stochastic Gradient Descent

TMLR Paper3410 Authors

28 Sept 2024 (modified: 28 Oct 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: The use of adaptive learning rates in stochastic gradient-based methods has become a wide practice in machine learning, even become a default mode in training deep learning. Different variants of ALR techniques, including AdaGrad, Adam, AMSGRAD, and RMSProp, have reported significant success in improving stochastic optimization. Despite these empirical successes, there is an extremely lack of clear comprehending how various ALR techniques affect both theoretical and empirical behaviors of powered stochastic gradient-based algorithms. Even, the impact of existing ALR techniques in common stochastic gradient-based algorithms is still under-explored. To fill the gap, this work develops a novel powered stochastic gradient-based algorithm with generalized adaptive learning rates, coined ADAptive Powered Stochastic Gradient Descent (ADA-PSGD), for nonconvex optimization problems. We particularly elucidate numerous connections of ADA-PSGD to existing ALR techniques. Moreover, we prove a faster convergence rate of ADA-PSGD for nonconvex optimization problems. Further, we show that ADA-PSGD achieves a gradient evaluation cost of $O\left(n+L^2\|\textbf{1}\|_p^2 (1-\alpha_1\beta_1)^{-1}\varepsilon^{-2}\right)$ ($\alpha_1\in [0, 1]$ and $\beta_1 \in [0, 1)$) to find an $\varepsilon$-approximate stationary point, which is comparable to the well-known algorithmic lower bound. Finally, we empirically demonstrate that our ADA-PSGD algorithm leads to greatly improved training in different machine learning tasks. Further, we hope that the robustness of ADA-PSGD to crucial hyper-parameters will spur interest from both researchers and practitioners.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Zhiyu_Zhang1
Submission Number: 3410
Loading