Keywords: maximal update parametrization, learning dynamics, hyperparameter transfer, efficiency, training, stability, scaling, numerics, fp8, low precision
TL;DR: We improve µP by combining it with Unit Scaling, leading to a simpler scheme with better default hyperparameters, lower loss, more efficient sweeping and simple FP8 training.
Abstract: The recent Maximal Update Parametrization (µP) enables the hyperparameters for small models
to transfer directly to large ones, substantially reducing the cost of training by avoiding expensive
sweeps at scale. We present a new scheme, u-µP, which improves upon µP by combining it with
Unit Scaling, a method for designing models that makes them easy to train in low-precision. The
two techniques have a natural affinity: µP ensures that the scale of activations is independent of
model size, and Unit Scaling ensures that the starting-scale of these activations is one (along with
weights and gradients). This synthesis opens the door to a simpler scheme, whose default values are
near-optimal. This in turn facilitates a more efficient sweeping strategy, with u-µP models reaching
a lower loss than comparable µP models and working out-of-the-box in FP8.
Student Paper: No
Submission Number: 47
Loading