Keywords: Learning Rate Schedules, Convergence Analysis, Stochastic Gradient Descent, Online Learning
TL;DR: Expanding the understanding of the concept of the edge of stability to the stochastic gradient descent setting.
Abstract: The trade-off inherent in constant learning rate stochastic gradient descent (SGD) has been well-documented empirically: larger learning rates often yield faster convergence, but risk the possibility of exploding. However, the relevant question of an appropriate choice of learning rate has rarely received systematic treatment; one often chooses learning schedules based on domain knowledge and preliminary numerical experiments without theoretical guidance. This question is intimately related to the concept of "edge of stability", which refers to a regime where the chain neither converges nor explodes. Despite rich literature on deterministic gradient descent, the rigorous characterization of "edge of stability" for the more ubiquitous SGD chains, remains an open question. In this paper, we formalize the notion of the stability region, and develop theoretical guarantees for estimating the stability region for SGD for a wide class of strongly convex objectives. We introduce a stochastic version of Lyapunov exponent for SGD, which yields a practical, data-driven threshold for admissible learning rates. Moreover, all of our theoretical results are backed by extensive experiments. Collectively, these findings demonstrate a practically implementable as well as theoretically valid way of choosing learning rate parameters in various problems, while also paving the way to potential generalization to more complicated nonconvex landscapes.
Supplementary Material: zip
Primary Area: learning theory
Submission Number: 21464
Loading