Early Stage Convergence and Global Convergence of Training Mildly Parameterized Neural NetworksDownload PDF

Published: 31 Oct 2022, Last Modified: 03 Jul 2024NeurIPS 2022 AcceptReaders: Everyone
Keywords: non-convex optimization, training dynamics, neural network
TL;DR: We study the convergence of GD and SGD when training mildly parameterized neural networks starting from random initialization.
Abstract: The convergence of GD and SGD when training mildly parameterized neural networks starting from random initialization is studied. For a broad range of models and loss functions, including the widely used square loss and cross entropy loss, we prove an ''early stage convergence'' result. We show that the loss is decreased by a significant amount in the early stage of the training, and this decreasing is fast. Furthurmore, for exponential type loss functions, and under some assumptions on the training data, we show global convergence of GD. Instead of relying on extreme over-parameterization, our study is based on a microscopic analysis of the activation patterns for the neurons, which helps us derive gradient lower bounds. The results on activation patterns, which we call ``neuron partition'', help build intuitions for understanding the behavior of neural networks' training dynamics, and may be of independent interest.
Supplementary Material: zip
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/early-stage-convergence-and-global/code)
31 Replies

Loading