A Theoretical and Empirical Study on the Convergence of Adam with an “Exact" Constant Step Size in Non-Convex Settings

TMLR Paper3043 Authors

21 Jul 2024 (modified: 12 Oct 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: In neural network training, RMSProp and Adam are widely favoured optimization algorithms. A key factor in their performance is selecting the correct step size, which can greatly influence their effectiveness. Moreover, the theoretical convergence properties of these algorithms remain a subject of significant interest. This article provides a theoretical analysis of a constant step size version of Adam in non-convex settings. It discusses the importance of using a fixed step size for Adam's convergence. We derive a constant step size for Adam and offer insights into its convergence in non-convex optimization scenarios. Firstly, we show that deterministic Adam can be affected by rapidly decaying learning rates, such as linear and exponential decay, which are often used to establish tight convergence bounds for Adam. This suggests that these rapidly decaying rates play a crucial role in driving convergence. Building on this observation, we derive a constant step size that depends on the dynamics of the network and the data, ensuring that Adam can reach critical points for smooth and non-convex objectives, with provided bounds on running time. Both deterministic and stochastic versions of Adam are analyzed, and we establish sufficient conditions for the derived constant step size to achieve asymptotic convergence of the gradients to zero with minimal assumptions. We conduct experiments to empirically compare Adam's convergence with our proposed constant step size against state-of-the-art step size schedulers on classification tasks. Furthermore, we demonstrate that our derived constant step size outperforms various state-of-the-art learning rate schedulers and a range of constant step sizes in reducing gradient norms. Our empirical results also indicate that, despite accumulating a few past gradients, the key driver for convergence in Adam is the use of non-increasing step sizes.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We thank all the reviewers for providing valuable feedbacks. Following provided suggestions and comments, we have extensively revised our manuscript. In particular, we have performed extensive numerical experiments and provided justifications to address the reviewers' comments. Below are our main additions in the revised manuscript: 1. We have added a fully separate section on analysis and comparison of our Lipschitz estimation method with other efficient Lipschitz estimation methods in our Appendix (Section C.1) 2. We have now included two new Sections studying the effect of $T$ and $\rho$ on the convergence with our step size. We also provide in-line responses to each reviewer addressing the changes they suggested.
Assigned Action Editor: ~Lijun_Zhang1
Submission Number: 3043
Loading