Momentum via Primal Averaging: Theoretical Insights and Learning Rate Schedules for Non-Convex Optimization

TMLR Paper360 Authors

12 Aug 2022 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Momentum methods are now used pervasively within the machine learning community for training non-convex models such as deep neural networks. Empirically, they outperform traditional stochastic gradient descent (SGD) approaches. In this work we develop an Lyapunov analysis of SGD with momentum (SGD+M), by utilizing an equivalent rewriting of the method known as the stochastic primal averaging (SPA) form. This analysis is tight enough to give precise insights into when SGD+M may outperform SGD, and what hyper-parameter schedules will work and why. Surprisingly, we show that the commonly used stage-wise schedule doesn't make sense in SPA form, and discuss how to fix it. Our theory suggests that momentum is only useful at the early stages of training, and we verify this empirically by showing that dropping momentum after one epoch results in no loss of final test accuracy on CIFAR-10 and ImageNet training.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Sebastian_U_Stich1
Submission Number: 360
Loading