Open Peer Review. Open Publishing. Open Access. Open Discussion. Open Directory. Open Recommendations. Open API. Open Source.
SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data
Alon Brutzkus, Amir Globerson, Eran Malach, Shai Shalev-Shwartz
Feb 15, 2018 (modified: Feb 23, 2018)ICLR 2018 Conference Blind Submissionreaders: everyoneShow Bibtex
Abstract:Neural networks exhibit good generalization behavior in the
over-parameterized regime, where the number of network parameters
exceeds the number of observations. Nonetheless,
current generalization bounds for neural networks fail to explain this
phenomenon. In an attempt to bridge this gap, we study the problem of
learning a two-layer over-parameterized neural network, when the data is generated by a linearly separable function. In the case where the network has Leaky
ReLU activations, we provide both optimization and generalization guarantees for over-parameterized networks.
Specifically, we prove convergence rates of SGD to a global
minimum and provide generalization guarantees for this global minimum
that are independent of the network size.
Therefore, our result clearly shows that the use of SGD for optimization both finds a global minimum, and avoids overfitting despite the high capacity of the model. This is the first theoretical demonstration that SGD can avoid overfitting, when learning over-specified neural network classifiers.
TL;DR:We show that SGD learns two-layer over-parameterized neural networks with Leaky ReLU activations that provably generalize on linearly separable data.