Abstract: We use empirical methods to argue that deep neural networks (DNNs) do not achieve their performance by \textit{memorizing} training data, in spite of overly-expressive model architectures.
Instead, they learn a simple available hypothesis that fits the finite data samples.
In support of this view, we establish that there are qualitative differences when learning noise vs.~natural datasets, showing that: (1) more capacity is needed to fit noise, (2) time to convergence is longer for random labels, but \emph{shorter} for random inputs, and (3) DNNs trained on real data examples learn simpler functions than when trained with noise data, as measured by the sharpness of the loss function at convergence.
Finally, we demonstrate that for appropriately tuned explicit regularization, e.g.~dropout, we can degrade DNN training performance on noise datasets without compromising generalization on real data.
TL;DR: Deep Nets Don't Learn via Memorization
Keywords: Deep learning, Optimization
Conflicts: umontreal.ca, polymtl.ca, uj.edu.pl, iai.uni-bonn.de
5 Replies
Loading