- TL;DR: Amortizing Nesterov's momentum for more robust, lightweight and fast deep learning training.
- Abstract: Stochastic Gradient Descent (SGD) with Nesterov's momentum is a widely used optimizer in deep learning, which is observed to have excellent generalization performance. However, due to the large stochasticity, SGD with Nesterov's momentum is not robust, i.e., its performance may deviate significantly from the expectation. In this work, we propose Amortized Nesterov's Momentum, a special variant of Nesterov's momentum which has more robust iterates, faster convergence in the early stage and higher efficiency. Our experimental results show that this new momentum achieves similar (sometimes better) generalization performance with little-to-no tuning. In the convex case, we provide optimal convergence rates for our new methods and discuss how the theorems explain the empirical results.
- Code: https://gofile.io/?c=e26cUT
- Keywords: momentum, nesterov, optimization, deep learning, neural networks