Scaling Law with Learning Rate Annealing

Howe Tissue; Venus Wang; Lu Wang

Scaling Law with Learning Rate Annealing

Howe Tissue, Venus Wang, Lu Wang

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Scaling Laws, Full Loss Curve Prediction, Learning Rate Schedule, LLM Pretraining

TL;DR: A new scaling law formula with learning rate annealing that can fit and predict full loss curves.

Abstract: We find that the cross-entropy loss curves of neural language models empirically adhere to a scaling law with learning rate (LR) annealing over training steps: $$L(s) = L_0 + A\cdot S_1^{-\alpha} - C\cdot S_2,$$ where $L(s)$ is the validation loss at step $s$, $S_1$ is the area under the LR curve, $S_2$ is the LR annealing area, and $L_0$, $A$, $C$, $\alpha$ are constant parameters. This formulation accounts for two main effects: (1) power-law scaling over data size, and (2) the additional loss reduction during LR annealing. Unlike previous studies that only fit losses at final steps, our formulation captures the entire training curve, allowing for parameter fitting using losses from any training step. Applying the scaling law with LR annealing and fitting only one or two training curves, we can accurately predict the loss at any given step under any learning rate scheduler (LRS). This approach significantly reduces computational cost in formulating scaling laws while providing more accuracy and expressiveness. Extensive experiments demonstrate that our findings hold across a range of hyper-parameters and model architectures and can extend to scaling effect of model sizes. Moreover, our formulation provides accurate theoretical insights into empirical results observed in numerous previous studies, particularly those focusing on LR schedule and annealing. We believe that this work is promising to enhance the understanding of LLM training dynamics while democratizing scaling laws, and it is helpful to guide both research and industrial participants in refining training strategies for further LLMs.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6174

Loading