Abstract: This paper is an independent empirical reproduction of the claimed benefits of the $\mu$P parametrization proposed in \citet{yang2020feature} and \citet{yang2021tuning}. Under the so-called Standard Parametrization (SP), the weights of neural networks are initialized from the Gaussian distribution with variance scaling as the inverse of ``fan-in'', while the learning rate is the same for every layer. While this guarantees that (pre)activations are $\mathcal{O}(1)$ at initialization with respect to width, it causes their scale to be width-dependent during training. To address this, \citet{yang2020feature} and \citet{yang2021tuning} proposed the Maximal Update Parametrization ($\mu$P), which is also claimed to make the optimal value of various hyperparameters independent of width. However, despite its alleged benefits, $\mu$P has not gained much traction among practitioners. Possibly, this could stem from a lack of thorough independent evaluation of $\mu$P against SP. We address this by independently reproducing the empirical claims of the original works. At the same time, we substantially increase the scale of the experiments, by training more than $10000$ neural networks of sizes from $500$ to $0.5$B parameters, and empirically investigate $\mu$P's effect on outputs, gradient updates, weights, training loss and validation loss. We find that generally $\mu$P indeed delivers on its promises, even though this does not always translate to improved generalization.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: 1) Minor corrections
2) Added computational requirements
Assigned Action Editor: ~Anastasios_Kyrillidis2
Submission Number: 3636
Loading