Keywords: Learning rate schedules, Large language models (LLMs), AdamW optimizer, Weight decay, Compute-optimal training
TL;DR: We perform a large-scale empirical study to establish that linear decay-to-zero is the optimal learning rate schedule for LLMs across a range of settings; some novel theoretical analysis helps explain why.
Abstract: LLMs are commonly trained with a learning rate (LR) warmup, followed by cosine decay to 10% of the maximum (10x decay). In a large-scale empirical study, we show that under an optimal max LR, a simple linear decay-to-zero (D2Z) schedule consistently outperforms other schedules when training at compute-optimal dataset sizes. Benefits increase further with more training tokens; e.g., a 617M-parameter model trained for 80 tokens-per-parameter (TPP) using D2Z achieves lower loss than when trained for 200 TPP using 10x decay, corresponding to an astonishing 60% FLOPs savings. This implies models like Llama2-7B, trained for 286 TPP with 10x decay, were severely under-decayed. We demonstrate the benefits of D2Z across a range of model sizes, batch sizes, and other training configurations. We explain the success of linear D2Z via a novel interpretation of AdamW as a convex combination of weight updates, with coefficients governed by the LR schedule. This interpretation demonstrates how linear D2Z balances the demands of early training (moving away quickly from initial conditions) and late training (smoothing over more updates to mitigate gradient noise).
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7729
Loading