Weight Decay may matter more than µP for Learning Rate Transfer in Practice

Atli Kosson; Jeremy Welborn; Yang Liu; Martin Jaggi; Xi Chen

Weight Decay may matter more than µP for Learning Rate Transfer in Practice

Atli Kosson, Jeremy Welborn, Yang Liu, Martin Jaggi, Xi Chen

Published: 22 Sept 2025, Last Modified: 01 Dec 2025NeurIPS 2025 WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: mup, weight decay, learning rate transfer, optimization, adamw, transformer, llm

TL;DR: muP relies on weight decay for successful learning rate transfer across model widths in practice

Abstract: Transferring the optimal learning rate from small to large neural networks can enable efficient training at scales where hyperparameter tuning is otherwise prohibitively expensive. To this end, the Maximal Update Parameterization (µP) proposes a learning rate scaling designed to keep the update dynamics of internal representations stable across different model widths. However, the scaling rule of µP relies on strong assumptions, particularly about the alignment between the weights and updates of a layer and its inputs. We empirically show that in the practical setups where learning rate transfer is most valuable, such as LLM training, these assumptions hold only briefly at the start of training. For the remainder of training it is weight decay rather than µP that stabilizes the update dynamics of internal representations across widths, facilitating learning rate transfer. Instead, the learning rate scaling of µP acts as a form of learning rate warmup and can sometimes be replaced by one. Overall, this work fundamentally challenges prevailing beliefs about learning rate transfer and explains why µP requires the independent weight decay variant for successful transfer.

Submission Number: 41

Loading