Keywords: second order optimization; scaling laws; maximum update paramterization; batch size scaling; depth scaling; critical batch size; compute optimal scaling
TL;DR: We investigate how to scale second-order optimizers effectively, showing they outperform Adam and reduce data needs in compute-optimal transformer training.
Abstract: Several recently introduced deep learning optimizers inspired by second-order methods have shown promising speedups relative to the current dominant optimizer AdamW, particularly in relatively small-scale experiments. However, efforts to validate and replicate their successes have reported mixed results, with some finding quickly diminishing advantage over AdamW with scale. In this work, we investigate \emph{how to scale} second-order optimizers to achieve optimal performance at scale. Through theoretical and empirical analysis, we derive scaling rules for hyperparameters such as learning rate and weight decay as we scale up model width and depth for a wide range of optimizers, including Shampoo, SOAP, and Muon, accounting for the impact of commonly used techniques such as blocking and grafting. For compute-optimal scaling, we find scaling independent weight decay as $1/\mathrm{width}$ is nearly optimal across optimizers, and that second-order optimizers have a substantially larger optimal model size compared to AdamW for a fixed compute budget. Applying these scaling rules, we show Muon achieves close to $1.4\times$ or higher speedup over AdamW in training transformer language models, while incorrect scaling can decrease the speedup from $1.4\times$ to below $1.1\times$ from $190$M to $640$M parameter models.
Supplementary Material: zip
Primary Area: Optimization (e.g., convex and non-convex, stochastic, robust)
Submission Number: 24119
Loading