Lean and Mean Adaptive Optimization via Subset-Norm and Subspace-Momentum with Convergence Guarantees
TL;DR: We introduce two complementary techniques for efficient optimization that reduce memory requirements while accelerating training of large-scale neural networks.
Abstract: We introduce two complementary techniques for efficient optimization that reduce memory requirements while accelerating training of
large-scale neural networks. The first technique, Subset-Norm step size, generalizes AdaGrad-Norm and AdaGrad(-Coordinate) through step-size sharing. Subset-Norm (SN) reduces AdaGrad's memory footprint from $O(d)$ to $O(\sqrt{d})$, where $d$ is the model size. For non-convex smooth objectives under coordinate-wise sub-gaussian noise, we show a noise-adapted high-probability convergence guarantee with improved dimensional dependence of SN over existing methods. Our second technique, Subspace-Momentum, reduces the momentum state's memory footprint by restricting momentum to a low-dimensional subspace while performing SGD in the orthogonal complement. We prove high-probability convergence rates for Subspace-Momentum under standard assumptions. Empirical evaluation on pre-training and fine-tuning LLMs demonstrates the effectiveness of our methods. For instance, combining Subset-Norm with Subspace-Momentum achieves Adam's validation perplexity for LLaMA 1B in approximately half the training
tokens (6.8B vs 13.1B) while reducing Adam's optimizer-states memory footprint by more than 80\% with minimal additional hyperparameter tuning.
Lay Summary: Optimizers for training deep neural networks like Adam maintain internal states that consume substantial amount of memory for large models. We introduce novel algorithms that not only reduce memory consumptions but also achieve faster convergence than Adam and other recent memory efficient optimizers like GaLore and Adam, both experimentally and theoretically.
Link To Code: https://github.com/timmytonga/sn-sm
Primary Area: Optimization
Keywords: Memory-efficient optimization, adaptive optimization, parameter efficient fine-tuning, PEFT, LLMs, high-probability convergence, stochastic optimization
Submission Number: 3209
Loading