Note on Learning Rate Schedules for Stochastic Optimization

Christian Darken, John E. Moody

1990 (modified: 11 Nov 2022)NIPS 1990Readers: Everyone

Abstract: We present and compare learning rate schedules for stochastic gradient descent, a general algorithm which includes LMS, on-line backpropaga(cid:173) tion and k-means clustering as special cases. We introduce "search-then(cid:173) converge" type schedules which outperform the classical constant and "running average" (1ft) schedules both in speed of convergence and quality of solution. Introduction: Stochastic Gradient Descent 1 tion G(W). In the context of learning systems typically G(W) = £x E(W, X), i.e. The optimization task is to find a parameter vector W which minimizes a func(cid:173) G is the average of an objective function over the exemplars, labeled E and X respectively. The stochastic gradient descent algorithm is Ll Wet) = -1](t)V'w E(W(t), X(t)). where t is the "time", and X(t) is the most recent independently-chosen random exemplar. For comparison, the deterministic gradient descent algorithm is Ll Wet) = -1](t)V'w£x E(W(t), X).

0 Replies