Keywords: scaling, maximal update parameterization, muP, feature learning, hyperparameter transfer, optimization
TL;DR: Spectral Condition for $\mu$P under Width–Depth Scaling.
Abstract: Generative foundation models are increasingly scaled in both width and depth, posing significant challenges for stable feature learning and reliable hyperparameter (HP) transfer across model sizes.
While maximal update parameterization ($\mu$P) has provided a principled solution to both problems for width scaling, existing extensions to the joint width–depth scaling regime remain fragmented, architecture- and optimizer-specific, and often rely on technically involved theories.
In this work, we develop a simple and unified spectral framework for $\mu$P under joint width–depth scaling. Considering residual networks of varying block depths, we first introduce a spectral $\mu$P condition that precisely characterizes how the norms of weights and their per-step updates should scale with width and depth, unifying previously disparate $\mu$P formulations as special cases.
Building on this condition, we then derive a general recipe for implementing $\mu$P across a broad class of optimizers by mapping the spectral constraints to concrete HP parameterizations.
This approach not only recovers existing $\mu$P formulations (e.g., for SGD and AdamW) but also naturally extends to a wider range of optimizers.
Finally, experiments on GPT-2 style language models demonstrate that the proposed spectral $\mu$P condition preserves stable feature learning and enables robust HP transfer under width–depth scaling.
Our code is available at https://github.com/ML-GSAI/Width-Depth-muP.
Submission Number: 63
Loading