Keywords: Stochastic Optimization, Hyperparameter Tuning, Learning Rate Schedules, Convex Optimization, Polyak Stepsize
TL;DR: We give new convergence theory for schedule-free with different schedules, and show that this theory is predictive, and the new adaptive schedules are competitive
Abstract: The recently proposed schedule-free method has been shown to achieve strong performance when hyperparameter tuning is limited. To further enhance the performance, our work suggests two sets of theoretically inspired hyperparameters. First, we develop the last-iterate convergence theory with an arbitrary schedule for convex and Lipschitz setting that naturally suggests the choice of the momentum parameters. When applied to the warmup-stable-decay (wsd) schedule, our theory shows the optimal convergence rate of $\mathcal{O}(1/\sqrt{T})$. Second, we develop the Polyak stepsize for the schedule-free method and provide an any-time convergence theorem for the convex and Lipschitz setting. Our theoretically inspired hyperparameters are shown to achieve state-of-the-art performance across a range of image and language learning tasks.
Primary Area: optimization
Submission Number: 1172
Loading