When, Why and How Much? Adaptive Learning Rate Scheduling by Refinement

Aaron Defazio; Ashok Cutkosky; Harsh Mehta; Konstantin Mishchenko

When, Why and How Much? Adaptive Learning Rate Scheduling by Refinement

Aaron Defazio, Ashok Cutkosky, Harsh Mehta, Konstantin Mishchenko

15 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: optimization

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: learning rates; linear decay; deep learning; online learning

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: In this paper, we present a refined study of learning rate schedules for stochastic gradient descent (SGD). In contrast to most prior works that study the convergence of the average iterate, we study the last iterate, which is what most people use in practice. Furthermore, we break away from the tradition of replacing the gradients with crude upper bounds, which allows us to obtain a \emph{problem-adaptive} learning rate schedule. Our method is the first systematic approach to \emph{automatically} yield learning rate warm-up and rapid learning rate annealing near the end of training. In cases where gradient norm information is not available, our theory predicts that the best choice is the linear-decay schedule that sets the stepsize proportionally to $1 - t/T$, where $t$ is the current iteration and $T$ is the total number of steps. Our final theoretical result is an extension of our methodology to coordinate-wise methods. We perform the most comprehensive evaluation of learning rate schedules to date, evaluating across 10 diverse deep learning problems, a series of LLMs, and a suite of logistic regression problems. We validate that overall, the linear-decay schedule outperforms all commonly used default schedules including cosine annealing, and that our schedule refinement method gives further improvements.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 239

Loading