Dissecting Adam: The Sign, Magnitude and Variance of Stochastic GradientsDownload PDF

15 Feb 2018 (modified: 21 Apr 2024)ICLR 2018 Conference Blind SubmissionReaders: Everyone
Abstract: The ADAM optimizer is exceedingly popular in the deep learning community. Often it works very well, sometimes it doesn’t. Why? We interpret ADAM as a combination of two aspects: for each weight, the update direction is determined by the sign of the stochastic gradient, whereas the update magnitude is solely determined by an estimate of its relative variance. We disentangle these two aspects and analyze them in isolation, shedding light on ADAM ’s inner workings. Transferring the "variance adaptation” to momentum- SGD gives rise to a novel method, completing the practitioner’s toolbox for problems where ADAM fails.
TL;DR: Analyzing the popular Adam optimizer
Keywords: Stochastic Optimization, Deep Learning
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:1705.07774/code)
11 Replies

Loading