Keywords: maximal update parametrization, llm, pretraining, hyperparameter transfer, learning dynamics, adamw, mup, weight decay, hyperparameter tuning, scaling law, transformer
TL;DR: Empirically-focused study showing µP requires weight decay to successfully transfer learning rates across model sizes in practice.
Abstract: Transferring the optimal learning rate from small to large neural networks can enable efficient training at scales where hyperparameter tuning is otherwise prohibitively expensive. To this end, the Maximal Update Parameterization (µP) proposes a learning rate scaling designed to keep the update dynamics of internal representations stable across different model widths. However, the scaling rules of µP rely on strong assumptions, particularly about the geometric alignment of a layer’s inputs with both its weights and gradient updates. In this large-scale empirical investigation, we show that these assumptions hold only briefly at the start of training in the practical setups where learning rate transfer is most valuable, such as LLM training. For the remainder of training it is weight decay rather than µP that correctly stabilizes the update dynamics of internal representations across widths, facilitating learning rate transfer. This suggests µP's scaling primarily acts as a form of implicit learning rate warmup, allowing us to largely replace it with modified warmup schedules. Together these findings fundamentally challenge prevailing beliefs about learning rate transfer and can explain empirical practice such as why µP requires the independent weight decay variant for successful transfer.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 19879
Loading