Keywords: depth separation, mean-ﬁeld, nonconvex optimization
TL;DR: We show that, using gradient flow, 3-layer networks can efficiently learn a function that no 2-layer networks can efficiently approximate.
Abstract: Depth separation—why a deeper network is more powerful than a shallow one—has been a major problem in deep learning theory. Previous results often focus on representation power, for example, Safran et al. (2019) constructed a function that is easy to approximate using a 3-layer network but not approximable by any 2-layer network. In this paper, we show that this separation is in fact algorithmic: one can learn the function constructed by Safran et al. (2019) using an overparametrized network with polynomially many neurons efﬁciently. Our result relies on a new way of extending the mean-ﬁeld limit to multilayer networks, and a decomposition of loss that factors out the error introduced by the discretization of inﬁnite-width mean-ﬁeld networks.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Theory (eg, control theory, learning theory, algorithmic game theory)