The Road Less Scheduled

Aaron Defazio; Xingyu Alice Yang; Ahmed Khaled; Konstantin Mishchenko; Harsh Mehta; Ashok Cutkosky

The Road Less Scheduled

Aaron Defazio, Xingyu Alice Yang, Ahmed Khaled, Konstantin Mishchenko, Harsh Mehta, Ashok Cutkosky

Published: 25 Sept 2024, Last Modified: 14 Jan 2025NeurIPS 2024 oralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Stochastic Optimization, Optimization, Convex Optimization, Learning Rates, Learning Rate Schedules

TL;DR: Train without learning rate schedules

Abstract: Existing learning rate schedules that do not require specification of the optimization stopping step $T$ are greatly out-performed by learning rate schedules that depend on $T$. We propose an approach that avoids the need for this stopping time by eschewing the use of schedules entirely, while exhibiting state-of-the-art performance compared to schedules across a wide family of problems ranging from convex problems to large-scale deep learning problems. Our Schedule-Free approach introduces no additional hyper-parameters over standard optimizers with momentum. Our method is a direct consequence of a new theory we develop that unifies scheduling and iterate averaging. An open source implementation of our method is available at https://github.com/facebookresearch/schedule_free. Schedule-Free AdamW is the core algorithm behind our winning entry to the MLCommons 2024 AlgoPerf Algorithmic Efficiency Challenge Self-Tuning track.

Primary Area: Optimization for deep networks

Submission Number: 14

Loading