Keywords: non-convex, stopping times, statistics, gradient descent, early stopping
TL;DR: How to bound the expected number of iterations before gradient descent finds a stationary point
Abstract: This work examines the convergence of stochastic gradient algorithms that use early stopping based on a validation function, wherein optimization ends when the magnitude of a validation function gradient drops below a threshold. We derive conditions that guarantee this stopping rule is well-defined and analyze the expected number of iterations and gradient evaluations needed to meet this criteria. The guarantee accounts for the distance between the training and validation sets, measured with the Wasserstein distance. We develop the approach for stochastic gradient descent (SGD), allowing for biased update directions subject to a Lyapunov condition. We apply the approach to obtain new bounds on the expected running time of several algorithms, including Decentralized SGD (DSGD), a variant of decentralized SGD, known as \textit{Stacked SGD}, and the stochastic variance reduced gradient (SVRG) algorithm. Finally, we consider the generalization properties of the iterate returned by early stopping.
Original Pdf: pdf
4 Replies
Loading