- Keywords: momentum, variance reduction
- Abstract: Stochastic gradient descent with momentum (SGD+M) is widely used to empirically improve the convergence behavior and the generalization performance of plain stochastic gradient descent (SGD) in the training of deep learning models, but our theoretical understanding for SGD+M is still very limited. Contrary to the conventional wisdom that sees the momentum in SGD+M as a way to extrapolate the iterates, this work provides an alternative view that interprets the momentum in SGD+M as a (biased) variance-reduced stochastic gradient. We rigorously prove that the momentum in SGD+M converges to the real gradient, with the variance vanishing asymptotically. This reduced variance in gradient estimation thus provides better convergence behavior and opens up a different path for future analyses of momentum methods. Because the reduction of the variance in the momentum requires neither a finite-sum structure in the objective function nor complicated hyperparameters to tune, SGD+M works on complicated deep learning models with possible involvement of data augmentation and dropout, on which many other variance reduction methods fail.
- One-sentence Summary: We prove that the variance of the momentum term in SGD+M vanishes asymptotically, so SGD+M can be interpreted as a variance-reduction stochastic method without any modification to the algorithm