Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit

Published: 07 Nov 2023, Last Modified: 13 Dec 2023M3L 2023 OralEveryoneRevisionsBibTeX
Keywords: infinite depth and width, residual networks, muP, deep learning theory
Abstract: We study residual networks with a residual branch scale of $1/\sqrt{\text{depth}}$ in combination with the $\mu$P parameterization. We provide experiments demonstrating that residual architectures including convolutional ResNets and Vision Transformers trained with this parameterization exhibit transfer of optimal hyperparameters across width and depth on CIFAR-10 and ImageNet. Furthermore, using recent developments in the dynamical mean field theory (DMFT) description of neural network learning dynamics, we show that this parameterization of ResNets admits a well-defined feature learning joint infinite-width and infinite-depth limit and show convergence of finite-size network dynamics towards this limit.
Submission Number: 32