TL;DR: We prove and validate that across non-recurrent multi-path architectures, the maximal stable learning rate decays with effective depth by a -3/2 power law, enabling zero-shot learning-rate transfer across depths.
Abstract: Deeper modern architectures are costly to tune, and the base learning rate is often one of the most sensitive hyperparameters. Maximal Update Parametrization ($\mu$P) helps explain why many hyperparameters transfer across width. Yet depthwise learning-rate scaling is less understood for modern architectures with convolution, residual aggregation, and attention. To unify various non-recurrent multi-path neural networks such as CNNs, ResNets, and Transformers, we introduce an architecture-dependent notion of effective depth. Under stabilizing initializations and a maximal-update criterion, we derive a shared leading-order -3/2 law for the base learning-rate scale as effective depth grows. Here, the budget controls typical one-step representation-update energy at initialization, and effective depth counts sequential update-bearing units while absorbing fixed local structure into constants. Experiments across diverse architectures confirm the predicted slope and enable reliable zero-shot transfer of learning rates across depths and widths, turning depth scaling into a predictable hyperparameter-transfer problem.
Lay Summary: Training a deep neural network often requires many trial-and-error runs to find a good learning rate, the setting that controls how large each training step is. This becomes especially expensive when researchers make a model deeper, because a setting that works for a shallow model may not work for a deeper one. This paper studies how the learning rate should change as model depth increases. We find a simple depth-based rule that works across several common neural network designs, including models used in image recognition and modern AI systems. The rule lets researchers tune the learning rate at one depth and transfer it to another depth with much less search. Experiments show that this rule captures the main trend across different model types and datasets. The result can reduce wasted training runs, lower tuning cost, and make depth-scaling experiments more accessible.
Originally Submitted Supplementary Material: zip
Primary Area: Optimization
Keywords: learning rate scaling, depth scaling, muP, maximal update, ResNets, Transformers
Originally Submitted PDF: pdf
Submission Number: 34465
Loading