u-μP: The Unit-Scaled Maximal Update Parametrization

Published: 16 Jun 2024, Last Modified: 16 Jun 2024HiLD at ICML 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: maximal update parametrization, learning dynamics, hyperparameter transfer, efficiency, training, stability, scaling, numerics, fp8, low precision
TL;DR: We improve µP by combining it with Unit Scaling, leading to a simpler scheme with better default hyperparameters, lower loss, more efficient sweeping and simple FP8 training.
Abstract: The recent Maximal Update Parametrization (µP) enables the hyperparameters for small models to transfer directly to large ones, substantially reducing the cost of training by avoiding expensive sweeps at scale. We present a new scheme, u-µP, which improves upon µP by combining it with Unit Scaling, a method for designing models that makes them easy to train in low-precision. The two techniques have a natural affinity: µP ensures that the scale of activations is independent of model size, and Unit Scaling ensures that the starting-scale of these activations is one (along with weights and gradients). This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u-µP models reaching a lower loss than comparable µP models and working out-of-the-box in FP8.
Student Paper: No
Submission Number: 47
Loading