Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

Published: 21 Jun 2024, Last Modified: 26 Jul 2024ES-FoMo-II 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Scaling Laws, Large Language Models, Learning Rate Schedules, Weight Averaging
TL;DR: We show reliable scaling behavior of an alternative LR schedule for LLM training, thereby making scaling law experiments more accessible.
Abstract: Scale has become a crucial factor for obtaining strong machine learning models. As a result, understanding a model's scaling properties is key to designing new neural architectures and training schemes effectively. In this work, we argue that scale and training research has been needlessly complicated by the reliance on the cosine learning rate schedule, which requires a separate run for each training duration of interest. We investigate a direct alternative -- constant learning rate and cooldowns -- that allows reusing compute between runs of different lengths. We analyze different recipes for the schedule and find equivalent or improved performance to cosine, all while scaling predictably and reliably similar to cosine. Additionally, we show that stochastic weight averaging yields strong performance improvements along the training trajectory, without additional training costs, across different scales. Importantly, with these findings, we demonstrate that scaling experiments can be performed with significantly fewer GPU hours and FLOPs. Our code is available at https://github.com/epfml/schedules-and-scaling/.
Submission Number: 79
Loading