Keywords: Adam, Optimization, Normalized Gradient Methods
TL;DR: We present a robust framework for analyzing various normalized first order gradient methods such as Adam, obtaining theoretical guarantees for convergence rates for both full batch and stochastic Adam.
Abstract: This paper presents a fresh mathematical perspective on Adam, whose empirical success is in stark contrast with its analytic intractibility. We derive Adam via duality, showing that many of its design choices such as coordinate-wise normalization and exponential moving averages emerge naturally from a unified framework. Using this framework, we first analyze two normalized gradient descent methods in the setting of linearly separable data which favor different solutions with differing geometries: SignGD, which converges to a $\ell_{\infty}$-max-margin classifier at a rate of $\mathcal O( \frac{1}{\sqrt{t}})$, and \emph{Normalized GD}, which instead converges to a $\ell_2$-max-margin classifier at a rate of $\mathcal O( \frac{1}{t})$, vastly improving upon the $\mathcal O(\frac 1 {\ln t})$ rate for gradient descent. Next, we show that Adam, which replaces the solitary gradients within SignGD with exponential moving averages, achieves margin maximization at a rate of $\mathcal O(\frac 1 {\sqrt{t}} )$, whereas prior work requires additional assumptions and has a rate of $\mathcal O(\frac{1}{t^{1/3}})$. In the stochastic setting, this duality approach gives the first high probability convergence guarantee for low test error with standard empirical choices of the momentum factors $0<\beta_1<\beta_2<1$, improving upon prior work which can only establish bounds in expectation, and has a slower rate of $\mathcal O(\frac{1}{t^{1/4}})$.
Primary Area: learning theory
Submission Number: 21640
Loading