Keywords: maximal update parameterization, hyperparameter transfer, scalable training, adaptive optimizers, scaling laws, spectral learning, scalable optimization for ML, optimization for deep learning
TL;DR: Derivation and implementation of $\mu$P scaling laws for a general class of optimizers (AdamW, ADOPT, LAMB, Sophia).
Abstract: Tuning hyperparameters (HPs) for large language models (LLMs) is computationally expensive. Maximal update parameterization ($\mu$P) offers width-independent scaling rules that stabilize HPs, but prior derivations for SGD and Adam rely on tensor programs, which are difficult to extend. Building on recent work that introduced spectral conditions as an alternative to tensor programs, we propose a framework to derive $\mu$P for a broader class of optimizers, including AdamW, ADOPT, LAMB, and Sophia. We validate our derivations on NanoGPT and further provide empirical insights into depth-scaling parameterization for these optimizers.
Submission Number: 123
Loading