Adaptive kernel predictors from feature-learning infinite limits of neural networks

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: A theory of feature learning for Bayesian networks at infinite width.
Abstract: Previous influential work showed that infinite width limits of neural networks in the lazy training regime are described by kernel machines. Here, we show that neural networks trained in the rich infinite-width regime in two different settings are also described by kernel machines, but with data-dependent kernels. For both cases, we provide explicit expressions for the kernel predictors and prescriptions to numerically calculate them. To derive the first predictor, we study the large-width limit of feature-learning Bayesian networks, showing how feature learning leads to task-relevant adaptation of layer kernels and preactivation densities. The saddle point equations governing this limit result in a min-max optimization problem that defines the kernel predictor. To derive the second predictor, we study gradient flow training of randomly initialized networks trained with weight decay in the infinite-width limit using dynamical mean field theory (DMFT). The fixed point equations of the arising DMFT defines the task-adapted internal representations and the kernel predictor. We compare our kernel predictors to kernels derived from lazy regime and demonstrate that our adaptive kernels achieve lower test loss on benchmark datasets.
Lay Summary: Despite neural networks' widespread adoption in modern machine learning, the solutions that deep networks converge to after training are not well understood. One promising theoretical approach is to analyze infinite width neural networks or approximations near an infinite width limit. While many prior works analyze lazy learning limits of deep networks (where internal features in hidden layers are static over training), we instead consider Bayesian inference in networks whose large width limits maintain feature learning. The predictors in this limit are given by a kernel regression solution with a task-dependent deterministic kernel. This adapted feature kernel is the solution to a min-max optimization problem that depends on properties of both the inputs to the network, the target outputs, and the architecture of the network. We show that this limit outperforms lazy learning limits and other limits obtained from networks in NTK scaling.
Link To Code: https://github.com/clarissalauditi/adaptive_kernel_predictors
Primary Area: Deep Learning->Theory
Keywords: Kernel methods; feature learning; Bayesian networks
Submission Number: 12400
Loading