Aggregated Momentum: Stability Through Passive Damping

James Lucas; Shengyang Sun; Richard Zemel; Roger  Grosse

Aggregated Momentum: Stability Through Passive Damping

James Lucas, Shengyang Sun, Richard Zemel, Roger Grosse

Published: 21 Dec 2018, Last Modified: 22 Jun 2025ICLR 2019 Conference Blind SubmissionReaders: Everyone

Abstract: Momentum is a simple and widely used trick which allows gradient-based optimizers to pick up speed along low curvature directions. Its performance depends crucially on a damping coefficient. Largecamping coefficients can potentially deliver much larger speedups, but are prone to oscillations and instability; hence one typically resorts to small values such as 0.5 or 0.9. We propose Aggregated Momentum (AggMo), a variant of momentum which combines multiple velocity vectors with different damping coefficients. AggMo is trivial to implement, but significantly dampens oscillations, enabling it to remain stable even for aggressive damping coefficients such as 0.999. We reinterpret Nesterov's accelerated gradient descent as a special case of AggMo and analyze rates of convergence for quadratic objectives. Empirically, we find that AggMo is a suitable drop-in replacement for other momentum methods, and frequently delivers faster convergence with little to no tuning.

Keywords: momentum, optimization, deep learning, neural networks

TL;DR: We introduce a simple variant of momentum optimization which is able to outperform classical momentum, Nesterov, and Adam on deep learning tasks with minimal hyperparameter tuning.

Data: [CIFAR-10](https://paperswithcode.com/dataset/cifar-10), [CIFAR-100](https://paperswithcode.com/dataset/cifar-100), [MNIST](https://paperswithcode.com/dataset/mnist), [Penn Treebank](https://paperswithcode.com/dataset/penn-treebank)

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 3 code implementations](https://www.catalyzex.com/paper/aggregated-momentum-stability-through-passive/code)

11 Replies

Loading