Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 oralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Overfitting; feature learning; dynamical mean field theory; generalization;
TL;DR: Large neural networks first learn low dimensional feature representation then overfit the data and revert to a kernel regime.
Abstract: Understanding the inductive bias and generalization properties of large overparametrized machine learning models requires to characterize the dynamics of the training algorithm. We study the learning dynamics of large two-layer neural networks via dynamical mean field theory, a well established technique of non-equilibrium statistical physics. We show that, for large network width $m$, and large number of samples per input dimension $n/d$, the training dynamics exhibits a separation of timescales which implies: $(i)$ The emergence of a slow time scale associated with the growth in Gaussian/Rademacher complexity of the network; $(ii)$ Inductive bias towards small complexity if the initialization has small enough complexity; $(iii)$ A dynamical decoupling between feature learning and overfitting regimes; $(iv)$ A non-monotone behavior of the test error, associated `feature unlearning' regime at large times.
Primary Area: Theory (e.g., control theory, learning theory, algorithmic game theory)
Submission Number: 13847
Loading