Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients

Lukas Balles; Philipp Hennig

Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients

Lukas Balles, Philipp Hennig

15 Feb 2018 (modified: 22 Jun 2025)ICLR 2018 Conference Blind SubmissionReaders: Everyone

Abstract: The ADAM optimizer is exceedingly popular in the deep learning community. Often it works very well, sometimes it doesn’t. Why? We interpret ADAM as a combination of two aspects: for each weight, the update direction is determined by the sign of the stochastic gradient, whereas the update magnitude is solely determined by an estimate of its relative variance. We disentangle these two aspects and analyze them in isolation, shedding light on ADAM ’s inner workings. Transferring the "variance adaptation” to momentum- SGD gives rise to a novel method, completing the practitioner’s toolbox for problems where ADAM fails.

TL;DR: Analyzing the popular Adam optimizer

Keywords: Stochastic Optimization, Deep Learning

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 3 code implementations](https://www.catalyzex.com/paper/dissecting-adam-the-sign-magnitude-and/code)

11 Replies

Loading