- TL;DR: In the early phase of training of deep neural networks there exists a "break-even point" which determines properties of the entire optimization trajectory.
- Abstract: Understanding the optimization trajectory is critical to understand training of deep neural networks. We show how the hyperparameters of stochastic gradient descent influence the covariance of the gradients (K) and the Hessian of the training loss (H) along this trajectory. Based on a theoretical model, we predict that using a high learning rate or a small batch size in the early phase of training leads SGD to regions of the parameter space with (1) reduced spectral norm of K, and (2) improved conditioning of K and H. We show that the point on the trajectory after which these effects hold, which we refer to as the break-even point, is reached early during training. We demonstrate these effects empirically for a range of deep neural networks applied to multiple different tasks. Finally, we apply our analysis to networks with batch normalization (BN) layers and find that it is necessary to use a high learning rate to achieve loss smoothing effects attributed previously to BN alone.
- Keywords: generalization, sgd, learning rate, batch size, hessian, curvature, trajectory, optimization