Supplementary Material: zip
Primary Area: learning theory
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Deep Learning Theory, Learning Theory, Gradient Descent, Analysis of Boolean functions
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Many works have shown learnability of functions on the Boolean hypercube via gradient descent. These analyses of gradient descent use the convexity of the problem to establish guarantees despite the fact that most loss functions are highly non-convex. In addition, the analyses explicitly show that the hypothesis class can approximate the target function; this is known as a representation theorem. In this work we give gradient descent guarantees for learning functions on the Boolean hypercube on both the mean squared and hinge losses with $2$-layer neural networks with a single hidden non-linear layer. Furthermore, all of our analyses apply to the ReLU activation function. Moreover, on both losses, we don't make use of any convexity of the problem, and don't explicitly prove a representation theorem. A representation theorem is a consequence of our analysis. In the hinge loss setting to learn size $k$ parities, with dimension $n$, and $\epsilon$ error, we obtain bounds of $n^{O(k)}poly(\frac{1}{\epsilon})$ and $n^{O(k)}\log(\frac{1}{\epsilon})$ for network width and samples, and iterations needed, respectively. This upper bound matches the SQ lower bounds of $n^{\Omega(k)}$. In the mean squared loss setting, given that the Fourier spectrum of an activation function has non-zero Fourier coefficients up to degree $k$, and given that the best degree $k$ polynomial approximation of the target function is $\epsilon_0$ in mean squared loss, we give guarantees for network width and samples, and iterations needed of $n^{O(k)}poly(\frac{1}{\epsilon})$ and $n^{O(k)}\log(\frac{1}{\epsilon})$ respectively for an error of $\epsilon+ \epsilon_0$. To the best of our knowledge, our bounds of $n^{O(k)}\log(\frac{1}{\epsilon})$ iterations needed for learning degree $k$ polynomials on both losses are better than any previous bounds in the Boolean setting, which is a consequence of not using any convexity of the problem in our analysis. Specifically, in other works in the Boolean setting, the bound on iterations is $n^{O(k)}poly(\frac{1}{\epsilon})$. Moreover, as a corollary to our agnostic learning guarantee, we establish that lower degree Fourier components are learned before higher degree ones, a phenomenon observed experimentally. Finally, as a corollary to our mean squared loss guarantee, we show that neural networks with sparse hidden ReLU units as target functions can be efficiently learned with gradient descent.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8722
Loading