Towards Understanding Convergence and Generalization of AdamW

Pan Zhou; Xingyu Xie; Shuicheng YAN

Towards Understanding Convergence and Generalization of AdamW

Pan Zhou, Xingyu Xie, Shuicheng YAN

Published: 01 Feb 2023, Last Modified: 13 Feb 2023ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: deep learning optimization, network optimizer

TL;DR: It theoretically proves the convergence of AdamW, and justifies its generalization superiority over both Adam and its $\ell_2$-regularized variant.

Abstract: AdamW modifies vanilla Adam by decaying network weights per training iteration, and shows remarkable generalization superiority over Adam and its $\ell_2$-regularized variant. In context of adaptive gradient algorithms (\eg~Adam), the decoupled weight decay in AdamW differs from the widely used $\ell_2$-regularizer, since the former does not affect optimization steps, while the latter changes the first- and second-order gradient moments and thus the optimization steps. Despite its great success on both vision transformers and CNNs, for AdamW, its convergence behavior and its generalization improvement over ($\ell_2$-regularized) Adam remain absent yet. To solve this issue, we prove the convergence of AdamW and justify its generalization advantages over Adam and its $\ell_2$-regularized version. Specifically, AdamW can provably converge but minimizes a dynamically regularized loss that combines a vanilla loss and a dynamical regularization induced by the decoupled weight decay, thus leading to its different behaviors compared with Adam and its $\ell_2$-regularized version. Moreover, on both general nonconvex problems and P\L-conditioned problems, we establish the stochastic gradient complexity of AdamW to find a stationary point. Such complexity is also applicable to Adam and its $\ell_{2}$-regularized variant, and indeed improves their previously known complexity, especially for modern over-parametrized networks. Besides, we theoretically show that AdamW often enjoys smaller generalization error bound than both Adam and its $\ell_2$-regularized variant from the Bayesian posterior aspect. This result, for the first time, explicitly reveals the benefits of the unique decoupled weight decay in AdamW. We hope the theoretical results in this work could motivate researchers to propose novel optimizers with faster convergence and better generalization. Experimental results testify our theoretical implications.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning

12 Replies

Loading