Shallow Neural Networks Learn Low-Degree Spherical Polynomials with Feature Learning by Learnable Channel Attention

Published: 18 Dec 2025, Last Modified: 02 Mar 2026ALT 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Nonparametric Regression, Low-Degree Spherical Polynomial, Neural Network, Learnable Channel Attention, Feature Learning, Gradient Descent, Minimax Optimal Rate
TL;DR: We show that a two-layer neural network with learnable channel attention and finite width trained by gradient descent requires the lowest sample complexity for learning a constant-degree spherical polynomial with any risk $\epsilon \in (0,1)$.
Abstract: We study the problem of learning a low-degree spherical polynomial of degree $\ell_0 = \Theta(1) \ge 1$ defined on the unit sphere in ${\mathbb R}^d$ by training an over-parameterized two-layer neural network (NN) with channel attention in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk $\epsilon \in (0,1)$, a carefully designed two-layer NN with channel attention and finite width trained by the vanilla gradient descent (GD) requires the lowest sample complexity of $n \asymp \Theta(d^{\ell_0}/\epsilon)$ with high probability, in contrast with the representative sample complexity $\Theta\big(d^{\ell_0} \max\set{\epsilon^{-2},\log d}\big)$, where $n$ is the training data size. Moreover, such sample complexity is not improvable since the trained network renders a sharp rate of the nonparametric regression risk of the order $\Theta(d^{\ell_0}/{n})$ with high probability. On the other hand, the minimax optimal rate for the regression risk with a kernel of rank $\Theta(d^{\ell_0})$ is $\Theta(d^{\ell_0}/{n})$, so that the rate of the nonparametric regression risk of the network trained by GD is minimax optimal. The training of the two-layer NN with channel attention is a two-stage process. In stage one, a novel and provable learnable channel selection algorithm, as a learnable harmonic-degree selection process, is employed to select the ground truth channel number in the target function, $\ell_0$, among the initial $L \ge \ell_0$ channels in its activation function in the first layer with high probability. Such learnable channel selection is performed by efficient one-step GD on both layers of the NN, which achieves the goal of feature learning in learning low-degree polynomials. In stage two, the second layer of the network is trained by standard GD using the activation function with selected channels. To the best of our knowledge, this is the first time a minimax optimal risk bound is obtained by training an over-parameterized but finite-width neural network with feature learning capability to learn low-degree spherical polynomials.
Submission Number: 97
Loading