Keywords: norm-based optimisation, matrix norm dynamics, optimal scaling, hyperparameter transfer
Abstract: Despite recent progress in optimal hyperparameter transfer under model and dataset scaling, no unifying explanatory principle has been established. For Adam and Scion optimizers, we discover that joint optimal scaling across model and dataset sizes is conditioned on a single invariant: the operator norm of the output layer. Across models with up to 1.3B parameters trained on up to 138B tokens, the optimal learning rate/batch size pair $(\eta^\ast, B^\ast)$ consistently has the same operator norm value — a phenomenon we term norm transfer. This constant norm condition is necessary but not sufficient: while for each dataset size, multiple $(\eta, B)$ reach the optimal norm, only a unique $(\eta^\ast, B^\ast)$ achieves the best loss. As a sufficient condition, we provide the first measurement of $(\eta^\ast, B^\ast)$ scaling with dataset size, finding it consistent with theoretical predictions. We share practical insights on norm-guided optimal scaling and release our Distributed Scion (Disco) implementation with logs from over two thousand runs to support research on LLM training dynamics.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 17
Loading