Keywords: sgd, momentum, nesterov, adam, qhm, qhadam, optimization
TL;DR: Mix plain SGD and momentum (or do something similar with Adam) for great profit.
Abstract: Momentum-based acceleration of stochastic gradient descent (SGD) is widely used in deep learning. We propose the quasi-hyperbolic momentum algorithm (QHM) as an extremely simple alteration of momentum SGD, averaging a plain SGD step with a momentum step. We describe numerous connections to and identities with other algorithms, and we characterize the set of two-state optimization algorithms that QHM can recover. Finally, we propose a QH variant of Adam called QHAdam, and we empirically demonstrate that our algorithms lead to significantly improved training in a variety of settings, including a new state-of-the-art result on WMT16 EN-DE. We hope that these empirical results, combined with the conceptual and practical simplicity of QHM and QHAdam, will spur interest from both practitioners and researchers. Code is immediately available.
Code: [![github](/images/github_icon.svg) facebookresearch/qhoptim](https://github.com/facebookresearch/qhoptim) + [![Papers with Code](/images/pwc_icon.svg) 1 community implementation](https://paperswithcode.com/paper/?openreview=S1fUpoR5FQ)
Data: [CIFAR-10](https://paperswithcode.com/dataset/cifar-10), [MuJoCo](https://paperswithcode.com/dataset/mujoco), [WikiText-103](https://paperswithcode.com/dataset/wikitext-103), [WikiText-2](https://paperswithcode.com/dataset/wikitext-2)
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 3 code implementations](https://www.catalyzex.com/paper/arxiv:1810.06801/code)