Connecting NTK and NNGP: A Unified Theoretical Framework for Neural Network Learning Dynamics in the Kernel Regime
Primary Area: learning theory
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Learning dynamics, Neural tangent kernel, Neural network Gaussian process, Infinite width limit, Representational drift, Statistical mechanics
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: This study introduces a generalized Neural Dynamical Kernel (NDK) and unifies the NTK and NNGP theories, providing an exact analytical theory for the entire learning dynamics in infinitely wide neural networks.
Abstract: Artificial neural networks (ANNs) have revolutionized machine learning in recent years, but a complete theoretical framework for their learning process is still lacking. Substantial theoretical advances have been achieved for infinitely wide networks. In this regime, two disparate theoretical frameworks have been used, in which the network’s output is described using kernels: one framework is based on the Neural Tangent Kernel (NTK), which assumes linearized gradient descent dynamics, while the Neural Network Gaussian Process (NNGP) kernel assumes a Bayesian framework. However, the relation between these two frameworks and between their underlying sets of assumptions has remained elusive. This work unifies these two distinct theories using gradient descent learning dynamics with an additional small noise in an ensemble of randomly initialized infinitely wide deep networks. We derive an exact analytical expression for the network input-output function during and after learning and introduce a new time-dependent Neural Dynamical Kernel (NDK) from which both NTK and NNGP kernels can be derived. We identify two important learning phases characterized by different time scales: gradient-driven and diffusive learning. In the initial gradient-driven learning phase, the dynamics is dominated by deterministic gradient descent, and is adequately described by the NTK theory. This phase is followed by the slow diffusive learning stage, during which the network parameters sample the solution space, ultimately approaching the equilibrium posterior distribution corresponding to NNGP. Combined with numerical evaluations on synthetic and benchmark datasets, we provide novel insights into the different roles of initialization, regularization, and network depth, as well as phenomena such as early stopping and representational drift. This work closes the gap between the NTK and NNGP theories, providing a comprehensive framework for understanding the learning process of deep neural networks in the infinite width limit.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: pdf
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6006
Loading