Keywords: Hyperparameter transfer
Abstract: A central characteristic of Maximal Update Parameterization ($\mu$P) is \textit{hyperparameter transfer}\textemdash the optimal hyperparameters (e.g., learning rate) found on small models continue to be optimal at large scales.
This allows practitioners to tune hyperparameters cheaply on small models and reuse them at scale, avoiding the prohibitive cost of direct tuning on large models.
In its original formulation, $\mu$P was derived under a set of analytical assumptions to ensure $\Theta(1)$ feature updates for a finite number of training steps in the large width limit $n \rightarrow \infty$.
Although $\Theta(1)$ feature updates do not formally imply that the optimal learning rate transfers across widths, such a transfer has been widely observed empirically in practice.
In this work, we identify a regime in which the optimal learning rate fails to transfer as the model is scaled.
We find that the optimal learning rate for the test performance sharply changes across the double descent transition, yet remains fairly consistent within both the under-parameterized and over-parameterized regimes.
We further show that weight decay and data augmentation can each improve the reliability of learning rate transfer, however, through different mechanisms.
Our findings clarify the practical boundaries of hyperparameter transfer and highlight regimes where optimal learning rates are unlikely to transfer reliably.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 127
Loading