Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale

Ran Tian; Ankur P Parikh

Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale

Ran Tian, Ankur P Parikh

Published: 01 Feb 2023, Last Modified: 13 Feb 2023Submitted to ICLR 2023Readers: Everyone

Keywords: optimization, asymptotic behavior of stochastic optimization, learning-rate decay, weight decay, language model pre-training, Transformer pre-training

TL;DR: An optimizer that consistently converges faster (<=70% training steps) than AdamW for pre-training Transformer variants.

Abstract: We present Amos, a stochastic gradient-based optimizer designed for training deep neural networks. It can be viewed as an Adam optimizer with theoretically supported, adaptive learning-rate decay and weight decay. A key insight behind Amos is that it leverages model-specific information to determine the initial learning-rate and decaying schedules. When used for pre-training BERT variants and T5, Amos consistently converges faster than the state-of-the-art settings of AdamW, achieving better validation loss within <=70% training steps and time, while requiring <=51% memory for slot variables.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Optimization (eg, convex and non-convex optimization)

14 Replies

Loading