On the Explicit Role of Initialization on the Convergence and Generalization Properties of Overparametrized Linear Networks

Hancheng Min; Salma Tarmoun; Rene Vidal; Enrique Mallada

On the Explicit Role of Initialization on the Convergence and Generalization Properties of Overparametrized Linear Networks

Hancheng Min, Salma Tarmoun, Rene Vidal, Enrique Mallada

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Abstract: Neural networks trained via gradient descent with random initialization and without any regularization enjoy good generalization performance in practice despite being highly overparametrized. A promising direction to explain this phenomenon is the \emph{Neural Tangent Kernel} (NTK), which characterizes the implicit regularization effect of gradient flow/descent on infinitely wide neural networks with random initialization. However, a non-asymptotic analysis that connects generalization performance, initialization, and optimization for finite width networks remains elusive. In this paper, we present a novel analysis of overparametrized single-hidden layer linear networks, which formally connects initialization, optimization, and overparametrization with generalization performance. We exploit the fact that gradient flow preserves a certain matrix that characterizes the \emph{imbalance} of the network weights, to show that the squared loss converges exponentially at a rate that depends on the level of imbalance of the initialization. Such guarantees on the convergence rate allow us to show that large hidden layer width, together with (properly scaled) random initialization, implicitly constrains the dynamics of the network parameters to be close to a low-dimensional manifold. In turn, minimizing the loss over this manifold leads to solutions with good generalization, which correspond to the min-norm solution in the linear case. Finally, we derive a novel $\mathcal{O}( h^{-1/2})$ upper-bound on the operator norm distance between the trained network and the min-norm solution, where $h$ is the hidden layer width.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

One-sentence Summary: This paper studies the convergence and generalization property of single-hidden-layer linear networks

Supplementary Material: zip

Reviewed Version (pdf): https://openreview.net/references/pdf?id=sB8TwRKBf4

10 Replies

Loading