## Decaying momentum helps neural network training

25 Sep 2019 (modified: 24 Dec 2019)ICLR 2020 Conference Blind SubmissionReaders: Everyone
• Original Pdf: pdf
• Keywords: sgd, momentum, adam, optimization, deep learning
• TL;DR: We introduce a momentum decay rule which significantly improves the performance of Adam and momentum SGD
• Abstract: Momentum is a simple and popular technique in deep learning for gradient-based optimizers. We propose a decaying momentum (Demon) rule, motivated by decaying the total contribution of a gradient to all future updates. Applying Demon to Adam leads to significantly improved training, notably competitive to momentum SGD with learning rate decay, even in settings in which adaptive methods are typically non-competitive. Similarly, applying Demon to momentum SGD rivals momentum SGD with learning rate decay, and in many cases leads to improved performance. Demon is trivial to implement and incurs limited extra computational overhead, compared to the vanilla counterparts.
• Code: https://gofile.io/?c=rRFyJF
14 Replies