Decaying momentum helps neural network training

John Chen; Anastasios Kyrillidis

Decaying momentum helps neural network training

John Chen, Anastasios Kyrillidis

25 Sept 2019 (modified: 22 Jun 2025)ICLR 2020 Conference Blind SubmissionReaders: Everyone

Keywords: sgd, momentum, adam, optimization, deep learning

TL;DR: We introduce a momentum decay rule which significantly improves the performance of Adam and momentum SGD

Abstract: Momentum is a simple and popular technique in deep learning for gradient-based optimizers. We propose a decaying momentum (Demon) rule, motivated by decaying the total contribution of a gradient to all future updates. Applying Demon to Adam leads to significantly improved training, notably competitive to momentum SGD with learning rate decay, even in settings in which adaptive methods are typically non-competitive. Similarly, applying Demon to momentum SGD rivals momentum SGD with learning rate decay, and in many cases leads to improved performance. Demon is trivial to implement and incurs limited extra computational overhead, compared to the vanilla counterparts.

Code: https://gofile.io/?c=rRFyJF

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/decaying-momentum-helps-neural-network/code)

Original Pdf: pdf

14 Replies

Loading