TL;DR: We present a theory to study *how* deep neural networks (of even super-constantly many layers) can perform hierarchical feature learning, on tasks that are not known to be efficiently solvable by non-hierarchical methods (such as kernel methods).
Abstract: (this is a theory paper)
Deep learning is also known as hierarchical learning, where the learner $\textit{learns}$ to represent a complex target function by decomposing it into a sequence of simpler functions to reduce sample and time complexity. This paper formally analyzes how multi-layer neural networks can perform such hierarchical learning $\textit{efficiently}$ and $\textit{automatically}$ by applying stochastic gradient descent (SGD) or its variants.
On the conceptual side, we present a characterizations of how certain deep (i.e. super-constantly many layers) neural networks can still be sample and time efficiently trained on hierarchical learning tasks, when no known existing algorithm (including layer-wise training, kernel method, etc) is efficient. We establish a new principle called ``backward feature correction'', where \emph{the errors in the lower-level features can be automatically corrected when training together with the higher-level layers}. We believe this is a key behind how deep learning is performing deep (hierarchical) learning, as opposed to layer-wise learning or simulating some known non-hierarchical method.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Theory (eg, control theory, learning theory, algorithmic game theory)
Supplementary Material: zip
9 Replies
Loading