Abstract: Training deep neural networks--and more recently, large models--demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this paper, to unleash the power of variance reduction for efficient training of large models, we propose a unified optimization framework, MARS (**M**ake v**A**riance **R**eduction **S**hine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within our framework, we introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. We also draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin.
Lay Summary: Training large models requires a lot of computing power and time. Scientists use specialized algorithms called optimizers to make this learning process more efficient. While popular adaptive learning optimizers like AdamW are helpful in adaptively adjusting step size of learning, they aren't perfect. Another set of techniques, known as variance reduction, aims to make the AI's learning steps more consistent and less erratic, but these haven't worked well for today's giant AI models.
We propose a new framework called MARS (**M**ake v**A**riance **R**eduction **S**hine) that offers a breakthrough. MARS cleverly brings together the benefits of established optimizers (like AdamW, Lion, and Shampoo) with a new, effective way to reduce variance. It uses scaled stochastic recursive momentum to balance the variance reduction with adaptive learning. This helps the AI learn more efficiently with less variance. In tests training GPT-2, a well-known large language model, MARS performed significantly better than the widely used AdamW optimizer. This new approach could lead to faster and more efficient training for the next generation of large AI systems.
Link To Code: https://github.com/AGI-Arena/MARS
Primary Area: Optimization
Keywords: variance reduction, adaptive learning, large models
Submission Number: 2440
Loading