Decaying momentum helps neural network trainingDownload PDF

25 Sept 2019 (modified: 28 May 2023)ICLR 2020 Conference Blind SubmissionReaders: Everyone
Original Pdf: pdf
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](
Keywords: sgd, momentum, adam, optimization, deep learning
TL;DR: We introduce a momentum decay rule which significantly improves the performance of Adam and momentum SGD
Abstract: Momentum is a simple and popular technique in deep learning for gradient-based optimizers. We propose a decaying momentum (Demon) rule, motivated by decaying the total contribution of a gradient to all future updates. Applying Demon to Adam leads to significantly improved training, notably competitive to momentum SGD with learning rate decay, even in settings in which adaptive methods are typically non-competitive. Similarly, applying Demon to momentum SGD rivals momentum SGD with learning rate decay, and in many cases leads to improved performance. Demon is trivial to implement and incurs limited extra computational overhead, compared to the vanilla counterparts.
14 Replies