Sharp Generalization for Shallow Neural Networks with Channel Attention

Yingzhen Yang

Sharp Generalization for Shallow Neural Networks with Channel Attention

Yingzhen Yang

Published: 29 May 2026, Last Modified: 29 May 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Nonparametric Regression, Over-Parameterized Neural Network, Channel Attention, Interpolation Space, Minimax Optimal Rate

TL;DR: We show that an over-parameterized neural network with channel attention trained by gradient descent with early stopping renders sharp risk bounds with comparison to the current state-of-the-art.

Abstract: We study nonparametric regression using an over-parameterized two-layer neural network with channel attention, where training features are drawn from an arbitrary continuous distribution on the unit sphere in $\RR^d$, and the target function lies in a standard interpolation space. We show that early-stopped gradient descent achieves a sharp regression risk of $\cO(\eps_n^2)$, where $\eps_n$ is the critical population rate of the induced attention kernel, improving upon the state-of-the-art~\citep{Yang2025-generalization-two-layer-regression} for distribution-free spherical covariates. When the covariate distribution satisfies an eigenvalue decay with parameter $2\alpha$ and $\alpha > 1/2$, the rate becomes $\cO(n^{-\frac{6\alpha}{6\alpha +1}})$ under spectral bias assumptions, improving over the nearly-optimal rate $\cO(n^{-\frac{6\alpha}{6\alpha+1}})\log^2(1/\delta)$~\citep{Li2024-edr-general-domain}, where $n$ is the sample size and $\delta \in (0,1)$. This is, to our knowledge, the first work establishing a theoretical advantage of channel attention for nonparametric regression. Our analysis shows that channel attention aligns with spectrally biased targets and induces a novel attention kernel. We decompose the network output at each gradient descent step into an RKHS component of this kernel and a small $L^{\infty}$ residual, and combine this with local Rademacher complexity to obtain sharp bounds. Our results further show that channel attention changes the training dynamics of the vanilla network without attention and enables escape from the linear NTK regime of the vanilla network, yielding better generalization than vanilla networks with lower kernel complexity, supported by simulations on synthetic and real data.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 87

Loading