- TL;DR: We show that the generalization behavior of wide neural networks depends strongly on their initialization.
- Abstract: The recently developed link between strongly overparametrized neural networks (NNs) and kernel methods has opened a new way to understand puzzling features of NNs, such as their convergence and generalization behaviors. In this paper, we make the bias of initialization on strongly overparametrized NNs under gradient descent explicit. We prove that fully-connected wide ReLU-NNs trained with squared loss are essentially a sum of two parts: The first is the minimum complexity solution of an interpolating kernel method, while the second contributes to the test error only and depends heavily on the initialization. This decomposition has two consequences: (a) the second part becomes negligible in the regime of small initialization variance, which allows us to transfer generalization bounds from minimum complexity interpolating kernel methods to NNs; (b) in the opposite regime, the test error of wide NNs increases significantly with the initialization variance, while still interpolating the training data perfectly. Our work shows that -- contrary to common belief -- the initialization scheme has a strong effect on generalization performance, providing a novel criterion to identify good initialization strategies.
- Keywords: overparametrization, generalization, initialization, gradient descent, kernel methods, deep learning theory