Keywords: Stochastic Optimization, Memory efficient, Non-convex optimization, Adaptive Stochastic Methods, Large-language models, LLMs, pre-training
TL;DR: More memory-efficient and faster convergence adaptive stochastic optimization with high-probability convergence guarantee for non-convex objectives.
Abstract: As deep neural networks grow larger, memory efficiency becomes crucial, with optimizer states of popular algorithms like Adam consuming substantial memory. This paper generalizes existing high-probability convergence analysis for AdaGrad and AdaGrad-Norm to arbitrary parameter partitions, encompassing both algorithms. We reveal a trade-off between coordinate-noise density and the convergence rate's dimensional dependency, suggesting an optimal grouping between the full coordinate version (AdaGrad) and the scalar version (AdaGrad-Norm). This insight leads to a principled compression approach called \textit{Subset-Norm}, targeting coordinate-wise second moment term in AdaGrad, RMSProp, and Adam. We demonstrate the empirical effectiveness of subset-norm step sizes in LLM pre-training tasks on LLaMA models, showing competitive performance to baselines like Adam while significantly reducing memory usage for the optimizer's state from $O(d)$ to $O(\sqrt{d})$ while introducing no additional hyperparameter.
Submission Number: 92
Loading