Win: Weight-Decay-Integrated Nesterov Acceleration for Adaptive Gradient AlgorithmsDownload PDF

Published: 01 Feb 2023, Last Modified: 02 Mar 2023ICLR 2023 notable top 5%Readers: Everyone
Keywords: Optimization acceleration in deep learning, network optimizers, deep learning optimizer, deep learning algorithm
TL;DR: We propose a new and general Weight-decay-Integrated Nesterov acceleration for adaptive algorithms to enhance their convergence speed, and also analyze their convergence justify their convergence superiority.
Abstract: Training deep networks on large-scale datasets is computationally challenging. In this work, we explore the problem of ``\textit{how to accelerate adaptive gradient algorithms in a general manner}", and aim to provide practical efficiency-boosting insights. To this end, we propose an effective and general {Weight-decay-Integrated Nesterov acceleration} (Win) to accelerate adaptive algorithms. Taking AdamW and Adam as examples, we minimize a dynamical loss per iteration which combines the vanilla training loss and a dynamic regularizer inspired by proximal point method (PPM) to improve the convexity of the problem. To introduce Nesterov-alike-acceleration into AdamW and Adam, we respectively use the first- and second-order Taylor approximations of vanilla loss to update the variable twice. In this way, we arrive at our Win acceleration for AdamW and Adam that uses a conservative step and a reckless step to update twice and then linearly combines these two updates for acceleration. Next, we extend Win acceleration to LAMB and SGD. Our transparent acceleration derivation could provide insights for other accelerated methods and their integration into adaptive algorithms. Besides, we prove the convergence of Win-accelerated adaptive algorithms and justify their convergence superiority over their non-accelerated counterparts by taking AdamW and Adam as examples. Experimental results testify to the faster convergence speed and superior performance of our Win-accelerated AdamW, Adam, LAMB and SGD over their non-accelerated counterparts on vision classification tasks and language modeling tasks with both CNN and Transformer backbones. We hope Win shall be a default acceleration option for popular optimizers in deep learning community to improve the training efficiency. Code will be released at \url{}.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
Supplementary Material: zip
16 Replies