Sharp Generalization for Nonparametric Regression by Over-Parameterized Neural Networks: A Distribution-Free Analysis in Spherical Covariate

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We show that an over-parameterized two-layer neural network trained by gradient descent (GD) exhibits minimax optimal convergence rates for nonparametric regression, and our results are distribution-free in spherical covariate.
Abstract: Sharp generalization bound for neural networks trained by gradient descent (GD) is of central interest in statistical learning theory and deep learning. In this paper, we consider nonparametric regression by an over-parameterized two-layer NN trained by GD. We show that, if the neural network is trained by GD with early stopping, then the trained network renders a sharp rate of the nonparametric regression risk of $O(\epsilon_n^2)$, which is the same rate as that for the classical kernel regression trained by GD with early stopping, where $\epsilon_n$ is the critical population rate of the Neural Tangent Kernel (NTK) associated with the network and $n$ is the size of the training data. It is remarked that our result does not require distributional assumptions on the covariate as long as the covariate lies on the unit sphere, in a strong contrast with many existing results which rely on specific distributions such as the spherical uniform data distribution or distributions satisfying certain restrictive conditions. As a special case of our general result, when the eigenvalues of the associated NTK decay at a rate of $\lambda_j \asymp j^{-\frac{d}{d-1}}$ for $j \ge 1$ which happens under certain distributional assumption such as the training features follow the spherical uniform distribution, we immediately obtain the minimax optimal rate of $O(n^{-\frac{d}{2d-1}})$, which is the major results of several existing works in this direction. The neural network width in our general result is lower bounded by a function of only $d$ and $\epsilon_n$, and such width does not depend on the minimum eigenvalue of the empirical NTK matrix whose lower bound usually requires additional assumptions on the training data. Our results are built upon two significant technical results which are of independent interest. First, uniform convergence to the NTK is established during the training process by GD, so that we can have a nice decomposition of the neural network function at any step of the GD into a function in the Reproducing Kernel Hilbert Space associated with the NTK and an error function with a small $L^{\infty}$-norm. Second, local Rademacher complexity is employed to tightly bound the Rademacher complexity of the function class comprising all the possible neural network functions obtained by GD. Our result formally fills the gap between training a classical kernel regression model and training an over-parameterized but finite-width neural network by GD for nonparametric regression without distributional assumptions about the spherical covariate.
Lay Summary: Understanding how well neural networks trained by gradient descent (GD) can generalize to new data is a core challenge in modern machine learning. In this work, we study a specific problem: predicting smooth relationships between inputs and outputs (nonparametric regression) using a two-layer neural network trained by GD. We show that, with early stopping, such a network can match the best-known performance of classical kernel methods — a class of powerful, well-understood algorithms — without relying on strong assumptions about the data distribution. Our results show that even with minimal structural assumptions (only requiring the input data to lie on a sphere), these neural networks achieve the same optimal prediction accuracy as if the data had followed more idealized, structured distributions. This makes our findings more widely applicable. To prove our results, we developed two new techniques: one that tracks how the network evolves during training, and another that carefully measures the complexity of all functions the network could learn. These tools may be useful in understanding broader classes of learning algorithms.
Primary Area: Deep Learning->Theory
Keywords: Nonparametric Regression, Over-Parameterized Neural Network, Gradient Descent, Minimax Optimal Rate
Submission Number: 572
Loading