Gradient Descent with Projection Finds Over-Parameterized Neural Networks for Learning Low-Degree Polynomials with Nearly Minimax Optimal Rate
Keywords: Nonparametric Regression, Low-Degree Spherical Polynomial, Neural Network, Gradient Descent, Feature Learning, Minimax Optimal Rate
TL;DR: We show that an over-parameterized two-layer neural network trained by a novel Gradient Descent with Projection method achieves nearly minimax optimal regression risk for learning low-degree spherical polynomials.
Abstract: We study learning a low-degree spherical polynomial of degree $k_0 = \Theta(1)$ on the unit sphere in $\RR^d$ using an over-parameterized two-layer neural network with augmented features. Our main result is an improved sample complexity: for any regression risk $\eps \in (0, \Theta(d^{-k_0})]$, a network trained via Gradient Descent with Projection (GDP) achieves
$n \asymp \Theta(\log(4/\delta)\cdot d^{k_0}/\eps)$ with probability $1-\delta$, $\delta \in (0,1)$, improving over $\Theta(d^{k_0}\max\{\eps^{-2},\log d\})$. This rate is nearly optimal, yielding regression risk $\log(4/\delta)\cdot \Theta(d^{k_0}/n)$ with probability at least $1-\delta$, close to the minimax rate $\Theta(d^{k_0}/n)$ for kernels of rank $\Theta(d^{k_0})$. To our knowledge, this is the first sharp risk bound with algorithmic guarantees for over-parameterized networks on such tasks. Our approach goes beyond the NTK limit by learning a subspace of its eigenspace, using a projection operator to restrict the solution to a low-dimensional RKHS subspace, enabling the sharp bound.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 177
Loading