Transfer Paramatters: Optimal per-Module Hyperparameters Across All Scaling Axes

Transfer Paramatters: Optimal per-Module Hyperparameters Across All Scaling Axes

ICLR 2026 Conference Submission14416 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: muP, tensor programs, hyperparameter optimization, hyperparameter transfer

TL;DR: A careful study of hyperparameter transfer (e.g. LR, batch size) when tuned first on small model and extended to larger scales

Abstract: Hyperparameter tuning can dramatically impact training stability of large-scale models. Recent works on neural network parameterisations, such as μP, have shown that layer types and sizes should dictate how global hyperparameters should be rescaled in order to achieve efficient transfer across model sizes. On the other hand, the established practice for hyperparameter optimisation search is to look for optimal global base values that apply at some fixed model scale. We transfer hyperparameters across all scaling axes: width and depth, using an extension of CompleteP (Dey et al., 2025), training horizon, and batch size. Our study covers all optimisation hyperparameters of modern models: learning rates, Adam parameters, weight decay, initialisation scales, and residual block multipliers. Lastly, we demonstrate that hyperparameter transfer holds even in the per-layer hyperparameter regime. We characterise the empirical challenges of navigating the high-dimensional hyperparameter landscape, and propose practical guidelines for tackling this optimisation problem. We suggest a simplified parameterisation of the hyperparameter space that reduces the dimensionality of the search-space at no performance cost. Our experiments demonstrate training speed improvements when applying transferred hyperparameters to Large Language Models.

Primary Area: transfer learning, meta learning, and lifelong learning

Submission Number: 14416

Loading