On the Variance of the Adaptive Learning Rate and BeyondDownload PDF

Published: 20 Dec 2019, Last Modified: 22 Oct 2023ICLR 2020 Conference Blind SubmissionReaders: Everyone
Keywords: warmup, adam, adaptive learning rate, variance
TL;DR: If warmup is the answer, what is the question?
Abstract: The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Pursuing the theory behind warmup, we identify a problem of the adaptive learning rate -- its variance is problematically large in the early stage, and presume warmup works as a variance reduction technique. We provide both empirical and theoretical evidence to verify our hypothesis. We further propose Rectified Adam (RAdam), a novel variant of Adam, by introducing a term to rectify the variance of the adaptive learning rate. Experimental results on image classification, language modeling, and neural machine translation verify our intuition and demonstrate the efficacy and robustness of RAdam.
Code: https://github.com/LiyuanLucasLiu/RAdam
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 20 code implementations](https://www.catalyzex.com/paper/arxiv:1908.03265/code)
Original Pdf: pdf
17 Replies