- Keywords: label corruption, over-parameterization, robustness, neural networks
- TL;DR: We prove that gradient descent is robust to label corruption despite over-parameterization under a rich dataset model.
- Abstract: This paper is focused on investigating and demystifying an intriguing robustness phenomena in over-parameterized neural network training. In particular we provide empirical and theoretical evidence that first order methods such as gradient descent are provably robust to noise/corruption on a constant fraction of the labels despite over-parameterization under a rich dataset model. In particular: i) First, we show that in the first few iterations where the updates are still in the vicinity of the initialization these algorithms only fit to the correct labels essentially ignoring the noisy labels. ii) Secondly, we prove that to start to overfit to the noisy labels these algorithms must stray rather far from from the initial model which can only occur after many more iterations. Together, these show that gradient descent with early stopping is provably robust to label noise and shed light on empirical robustness of deep networks as well as commonly adopted early-stopping heuristics.