Abstract: The loss surface of deep neural networks has recently attracted interest
in the optimization and machine learning communities as a prime example of
high-dimensional non-convex problem. Some insights were recently gained using spin glass
models and mean-field approximations, but at the expense of strongly simplifying the nonlinear nature of the model.
In this work, we do not make any such approximation and study conditions
on the data distribution and model architecture that prevent the existence
of bad local minima. Our theoretical work quantifies and formalizes two
important folklore facts: (i) the landscape of deep linear networks has a radically different topology
from that of deep half-rectified ones, and (ii) that the energy landscape
in the non-linear case is fundamentally controlled by the interplay between the smoothness of the data distribution and model over-parametrization. Our main theoretical contribution is to prove that half-rectified single layer networks are asymptotically connected, and we provide explicit bounds that reveal the aforementioned interplay.
The conditioning of gradient descent is the next challenge we address.
We study this question through the geometry of the level sets, and we introduce
an algorithm to efficiently estimate the regularity of such sets on large-scale networks.
Our empirical results show that these level sets remain connected throughout
all the learning phase, suggesting a near convex behavior, but they become
exponentially more curvy as the energy level decays, in accordance to what is observed in practice with
very low curvature attractors.
TL;DR: We provide theoretical, algorithmical and experimental results concerning the optimization landscape of deep neural networks
Conflicts: berkeley.edu, nyu.edu, fb.com
Keywords: Theory, Deep learning
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/arxiv:1611.01540/code)
20 Replies
Loading