On the Provable Separation of Scales in Maximal Update Parameterization

Letong Hong; Zhangyang Wang

On the Provable Separation of Scales in Maximal Update Parameterization

Letong Hong, Zhangyang Wang

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We establish the first fundamental separation of scales in µP between macro-variables (e.g. loss landscapes) and micro variables (e.g. individual weights).

Abstract: Maximal Update Parameterization ($\mu$P) has shown significant promise in allowing zero-shot hyperparameter transfer across neural network scales, reducing the prohibitive cost of hyperparameter tuning for large models. However, the theoretical foundation behind the observed approximate transferability of hyperparameters remains underexplored. Relying on a width-dominance regime, which ensures that as width grows, certain terms of the learning dynamics dominate, we establish the first fundamental separation of scales in $\mu$P between macro-variables (e.g. loss landscapes) and micro-variables (e.g. individual weights). Our formulation explains why hyperparameter tuning can be effectively performed in early training stages, i.e., \textit{early statistics effectively approximate global hyperparameter optima}, implying the potential to further reduce the training costs required for searching optimal hyperparameters. We further apply our main theory to explain an empirical deep learning phenomenon discovered independently by prior work.

Lay Summary: We prove that very wide neural networks learn on two distinct time-scales: overall performance indicators stabilise almost immediately, while individual weights change much more slowly. Because the “big-picture” settles so fast, hyper-parameters chosen on a small or early-stage model remain valid when the model is scaled up, explaining the success of μP’s zero-shot transfer. This two-speed view also predicts the real-world lag you see between changing the learning-rate and the loss curve’s response.

Primary Area: Deep Learning->Theory

Keywords: Maximal Update Parametrization, hyperparameter tuning, zero-shot transfer, stochastic differential equations

Submission Number: 9761

Loading