Keywords: Deep Learning, Optimization, Deep Networks, Regularization
Abstract: Several recent works have aimed to explain why severely overparameterized models, generalize well when trained by Stochastic Gradient Descent (SGD). The emergent consensus explanation has two parts: the first is that there are “no bad local minima,” while the second is that SGD performs implicit regularization by having a bias towards low complexity models. We revisit both of these ideas in the context of image classification with common deep neural network architectures. Our first finding is that there exist bad global minima, i.e., models that fit the training set perfectly, yet have poor generalization. Our second finding is that given only unlabeled training data, we can easily construct initializations that will cause SGD to quickly converge to such bad global minima that exhibit a test accuracy degradation of up to 40% compared to training from a random initialization. Finally, we show that regularization seems to provide SGD with an escape route: once heuristics such as data augmentation are used, starting from a complex model (adversarial initialization) has no effect on the test accuracy.