Sharp Generalization for Nonparametric Regression in Interpolation Space by Shallow Neural Networks with Channel Attention
Keywords: Nonparametric Regression, Over-Parameterized Neural Network, Channel Attention, Interpolation Space, Minimax Optimal Rate
TL;DR: We show that an over-parameterized neural network with channel attention trained by gradient descent with early stopping renders sharp risk bounds with comparison to the current state-of-the-art.
Abstract: We study nonparametric regression using an over-parameterized two-layer neural networks with channel attention in this paper, where the training features are drawn from arbitrary continuous distribution on the unit sphere in $\RR^d$, and the target function lies in an interpolation space commonly studied in statistical learning theory. We demonstrate that training the neural network with early-stopped gradient descent achieves a sharp nonparametric regression risk bound of
$\cO(\eps_n^2)$, where $\eps_n$ is the critical population rate of the kernel induced by the network with channel attention, and such risk bound is sharper than the current state-of-the-art regression risk~\citep{Yang2025-generalization-two-layer-regression} on the distribution-free spherical covariate. When the distribution of the covariate admits a widely studied eigenvalue decay rate with parameter $2\alpha$ such that $\alpha > 1/2$, our risk bound becomes $\cO(n^{-\frac{6\alpha}{6\alpha +1}})$ when the target function is in an interpolation space with widely studied spectral bias in deep learning. This rate is even sharper than the currently known nearly-optimal rate of $\cO(n^{-\frac{6\alpha}{6\alpha+1}})\log^2(1/\delta)$~\citep{Li2024-edr-general-domain}, where $n$ is the size of the training data and $\delta \in (0,1)$ is a small probability. Our analysis is based on two key technical contributions. First, we establish a principled decomposition of the network output at each GD step into a component lying in the reproducing kernel Hilbert space (RKHS) of a newly induced attention kernel and a residual term with small $L^{\infty}$-norm. Second, building on this decomposition, we employ local Rademacher complexity to obtain sharp bound for the complexity of the function class formed by all the neural network functions along the GD steps. Our findings further indicate that channel attention enables neural networks to escape the linear NTK regime and achieve sharper generalization than the vanilla neural network without channel attention, with the kernel complexity of the channel attention kernel lower than that of the standard NTK induced by the vanilla network. Our work is among the first to reveal the provable benefit of channel attention for nonparametric regression, with simulation results on both synthetic and real datasets.
Primary Area: learning theory
Submission Number: 2690
Loading