Keywords: Scaling laws; Feature learning dynamics; Infinite-width and infinite-depth limit; Residual networks; Stochastic differential equations; Neural Tangent Kernel (NTK); Maximal update parameterization (μP).
TL;DR: We study why neural scaling laws succeed and fail by proving that deep ResNets trained with SGD converge to a coupled forward–backward SDE system in the joint infinite-width and infinite-depth limit.
Abstract: The empirical success of deep learning is often attributed to scaling laws that predict steady performance gains as model, data, and compute increase. However, large models often suffer severe training instability and diminishing returns, indicating that scaling laws only describe *what success looks like* but not *when and why scaling succeeds or fails*. A central barrier is the lack of a rigorous understanding of feature learning at *large depth*: while $\mu$P provides a principled characterization of feature learning dynamics in the infinite-width limit and enables hyperparameter (HP) transfer across width, its depth extension, i.e., depth-$\mu$P, faces critical challenges, especially in residual blocks with more than one internal layer. In this paper, we address this gap by deriving the **Neural Feature Dynamics (NFD)**, a coupled forward-backward stochastic system that rigorously characterizes the training dynamics of ResNets in the joint infinite-width and infinite-depth limit. NFD reveals when scaling laws hold, explains diminishing returns, and shows that the *gradient-independence assumption (GIA)*, known to fail during training at finite depth, becomes provably valid again at infinite depth, identifying a new regime where end-to-end feature learning remains *tractable* for analysis. Moreover, NFD uncovers a structural cause of the failure of depth-$\mu$P: representation learning collapses in the first layer of two-layer residual blocks. Motivated by this insight, we introduce a simple **depth-aware learning-rate correction** that restores depth-wise HP transfer and yields overall stronger performance.
Primary Area: learning theory
Submission Number: 1938
Loading