Provable Adaptivity in AdamDownload PDF

Published: 01 Feb 2023, Last Modified: 12 Mar 2024Submitted to ICLR 2023Readers: Everyone
Keywords: Benefit of Adam, convergence, SGD, optimization
Abstract: Adaptive Moment Estimation (Adam) has been observed to converge faster than stochastic gradient descent (SGD) in practice. However, such an advantage has not been theoretically characterized -- the existing convergence rate of Adam is no better than SGD. We attribute this mismatch between theory and practice to a commonly used assumption: the gradient is globally Lipschitz continuous (called $L$-smooth condition). Specifically, compared to SGD, Adam adaptively chooses a learning rate better suited to the local gradient Lipschitz constant (called local smoothness). This effect becomes prominent when the local smoothness varies drastically across the domain. In this paper, we analyze the convergence of Adam under a condition called $(L_0,L_1)$-smooth condition, which allows the gradient Lipschitz constant to change with the gradient norm. This condition has been empirically verified to be more realistic for deep neural networks \citep{zhang2019gradient} than the $L$-smooth condition. Under $(L_0,L_1)$-smooth condition, we establish the convergence for Adam with practical hyperparameters. As such, we argue that Adam can adapt to this local smoothness condition, justifying Adam's \emph{adaptivity}. In contrast, SGD can be arbitrarily slow under this condition. Our result can shed light on the benefit of adaptive gradient methods over non-adaptive ones.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Theory (eg, control theory, learning theory, algorithmic game theory)
TL;DR: We explain why Adam is faster than SGD through convergence analysis.
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 5 code implementations](https://www.catalyzex.com/paper/arxiv:2208.09900/code)
19 Replies

Loading