Adam-family Methods with Decoupled Weight Decay in Deep Learning

TMLR Paper3634 Authors

06 Nov 2024 (modified: 04 Apr 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: In this paper, we investigate the convergence properties of a wide class of Adam-family methods for minimizing quadratically regularized nonsmooth nonconvex optimization problems, especially in the context of training nonsmooth neural networks with weight decay. Motivated by AdamW, we propose a novel framework for Adam-family methods with decoupled weight decay. Within our framework, the estimators for the first-order and second-order moments of stochastic subgradients are updated independently of the weight decay term. Under mild assumptions and with non-diminishing stepsizes for updating the primary optimization variables, we establish the convergence properties of our proposed framework. In addition, we show that our proposed framework encompasses a wide variety of well-known Adam-family methods, hence offering convergence guarantees for these methods in the training of nonsmooth neural networks. More importantly, compared to the existing results on the choices of the parameters for the moment terms in Adam, we show that our proposed framework provides more flexibility for these parameters. As a practical application of our proposed framework, we propose a novel Adam-family method named Adam with Decoupled Weight Decay (AdamD), and establish its convergence properties under mild conditions. Numerical experiments demonstrate that AdamD outperforms Adam and is comparable to AdamW, in the aspects of both generalization performance and efficiency.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Sebastian_U_Stich1
Submission Number: 3634
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview