- Abstract: The ADAM optimizer is exceedingly popular in the deep learning community. Often it works very well, sometimes it doesn’t. Why? We interpret ADAM as a combination of two aspects: for each weight, the update direction is determined by the sign of the stochastic gradient, whereas the update magnitude is solely determined by an estimate of its relative variance. We disentangle these two aspects and analyze them in isolation, shedding light on ADAM ’s inner workings. Transferring the "variance adaptation” to momentum- SGD gives rise to a novel method, completing the practitioner’s toolbox for problems where ADAM fails.
- TL;DR: Analyzing the popular Adam optimizer
- Keywords: Stochastic Optimization, Deep Learning