The Global Convergence Time of Stochastic Gradient Descent in Non-Convex Landscapes: Sharp Estimates via Large Deviations
TL;DR: We characterize the time to global convergence of stochastic gradient descent on non-convex objectives.
Abstract: In this paper, we examine the time it takes for stochastic gradient descent (SGD) to reach the global minimum of a general, non-convex loss function. We approach this question through the lens of large deviations theory and randomly perturbed dynamical systems, and we provide a tight characterization of the associated hitting times of SGD with matching upper and lower bounds. Our analysis reveals that the global convergence time of SGD is dominated by the most "costly" set of obstacles that the algorithm may need to overcome in order to reach a global minimizer, coupling in this way the geometry of the underlying loss landscape with the statistics of the noise entering the process. Finally, motivated by applications to the training of deep neural networks, we provide a series of refinements and extensions of our analysis to, among others, loss functions with no spurious local minima or ones with bounded depths.
Lay Summary: The stochastic gradient algorithm (SGD) is widely used in the training of neural networks. However, despite its practical success, its behavior is still elusive due to the non-convexity of the objective.
We propose to tackle the question of how fast does SGD attain a global optimum by relying on tools from the theory of large deviations. Our analysis builds on a careful estimation of transition times between critical points and enables matching upper- and lower-bounds.
This work thus enables a better understanding of how SGD behaves for deep neural networks and opens the way towards practical improvements.
Primary Area: Optimization->Stochastic
Keywords: stochastic gradient descent, non-convex, large deviations
Submission Number: 6146
Loading