TL;DR: We provide a way to measure the step-sizes of a neural network optimiser in function space, rather than the usual parameter space, and use this to enable reusing the optimal learning rate for small models in much wider and deeper models.
Abstract: We consider layerwise function-space learning rates, which measure the magnitude of the change in a neural network's output function in response to an update to a parameter tensor. This contrasts with traditional learning rates, which describe the magnitude of changes in parameter space. We develop efficient methods to measure and set function-space learning rates in arbitrary neural networks, requiring only minimal computational overhead through a few additional backward passes that can be performed at the start of, or periodically during, training. We demonstrate two key applications: (1) analysing the dynamics of standard neural network optimisers in function space, rather than parameter space, and (2) introducing FLeRM (Function-space Learning Rate Matching), a novel approach to hyperparameter transfer across model scales. FLeRM records function-space learning rates while training a small, cheap base model, then automatically adjusts parameter-space layerwise learning rates when training larger models to maintain consistent function-space updates. FLeRM gives hyperparameter transfer across model width, depth, initialisation scale, and LoRA rank in various architectures including MLPs with residual connections and transformers with different layer normalisation schemes.
Lay Summary: We propose a way to estimate how changes to individual parts of an AI system affect its operation as a whole. This will help AI researchers better understand why current AI systems work well and how to improve them in the future. For example, in our paper we utilise our method to tune very large AI systems by tuning a small, cheap model and then copying the settings to the large model (which would be very computationally expensive to tune by itself). This is not usually possible, because the optimal settings change between small and large AI systems, but with our method, we can predict and therefore correct these changes.
Link To Code: https://github.com/edwardmilsom/function-space-learning-rates-paper
Primary Area: Deep Learning->Everything Else
Keywords: deep learning, optimisation, function space, hyperparameter transfer, muP, learning rate, kronecker
Submission Number: 10462
Loading