A thorough reproduction and evaluation of $\mu$P

Georgios Vlassis; David Belius; Volodymyr Fomichov

A thorough reproduction and evaluation of $\mu$P

Georgios Vlassis, David Belius, Volodymyr Fomichov

Published: 23 Jan 2025, Last Modified: 23 Jan 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper is an independent empirical reproduction of the claimed benefits of the $\mu$P parametrization proposed in \citet{yang2020feature} and \citet{yang2021tuning}. Under the so-called Standard Parametrization (SP), the weights of neural networks are initialized from the Gaussian distribution with variance scaling as the inverse of ``fan-in'', with the learning rate being the same for every layer. While this guarantees that (pre)activations are $\mathcal{O}(1)$ at initialization with respect to width, it causes their scale to be width-dependent during training. To address this, \citet{yang2020feature} and \citet{yang2021tuning} proposed the Maximal Update Parametrization ($\mu$P), which is also claimed to make the optimal value of various hyperparameters independent of width. However, despite its alleged benefits, $\mu$P has not gained much traction among practitioners. Possibly, this could stem from a lack of thorough independent evaluation of $\mu$P against SP. We address this by independently reproducing the empirical claims of the original works. At the same time, we substantially increase the scale of the experiments, by training $16000$ neural networks of sizes from $500$ to $1$B parameters, and empirically investigate $\mu$P's effect on outputs, gradient updates, weights, training loss and validation loss. We find that generally $\mu$P indeed delivers on its promises, even though this does not always translate to improved generalization.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: 1) Added the confidence intervals in Table 2, adjusted even more the wording in result interpretation and added a comment on the caption of the table about the results' variance 2) Improved the theoretical discussions (second paragraph of Sec. 1.1, second paragraph of Sec. 4.1.1, Sec. 4.4, Sec. 5) 3) Improved wording 4) Deanonymized the document

Code: https://github.com/gvlassis/ant

Supplementary Material: zip

Assigned Action Editor: ~Anastasios_Kyrillidis2

Submission Number: 3636

Loading