Keywords: Stochastic optimization
Abstract: We investigate stochastic optimization under weaker assumptions on the distribution of noise than those used in usual analysis. Our assumptions are motivated by empirical observations in training neural networks. In particular, standard results on optimal convergence rates for stochastic optimization assume either there exists a uniform bound on the moments of the gradient noise, or that the noise decays as the algorithm progresses. These assumptions do not match the empirical behavior of optimization algorithms used in neural network training where the noise level in stochastic gradients could even increase with time. We address this nonstationary behavior of noise by analyzing convergence rates of stochastic gradient methods subject to changing second moment (or variance) of the stochastic oracle. When the noise variation is known, we show that it is always beneficial to adapt the step-size and exploit the noise variability. When the noise statistics are unknown, we obtain similar improvements by developing an online estimator of the noise level, thereby recovering close variants of RMSProp~\citep{tieleman2012lecture}. Consequently, our results reveal why adaptive step size methods can outperform SGD, while still enjoying theoretical guarantees.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
One-sentence Summary: We prove that moment estimation can accelerate SGD under the nonstationary noise setting.
Supplementary Material: zip
Reviewed Version (pdf): https://openreview.net/references/pdf?id=VWedOAuzG
14 Replies
Loading