How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

ICLR 2026 Conference Submission13070 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM pretraining, Curriculum Learning, Model Weight Average
TL;DR: Use model weight average to enhance curriculum learning in LLM pretraining.
Abstract: Curriculum learning is a powerful paradigm, yet its application to large language model (LLM) pretraining remains underexplored, especially in scenarios where high-quality data is limited yet crucial. Previous works have primarily focused on applying curriculum learning to LLM pretraining by searching for better data quality metrics. However, these approaches have yielded only marginal gains, and curriculum-based training is still not a standard practice. In this work, we explore the problem from the opposite perspective: if a good quality metric is available, can current curriculum learning strategies produce better results? We diagnose a key, yet overlooked, factor responsible for this deficiency: the interplay between the data order and the learning rate (LR) schedule. We find that while curriculum learning can greatly outperform pretraining with a uniform data distribution under a constant LR schedule, this advantage diminishes as the learning rate decays. Building on this observation, we propose replacing LR decay with model averaging, which involves computing a weighted average of last several model checkpoints. We find this strategy achieves better results than standard LR decay schedules, especially in a mid-training regime where only a portion of high-quality data is available. Furthermore, this approach reveals that model averaging is greatly strengthened with the occurrence of curriculum learning. Finally, we propose a co-designed strategy for curriculum-based LLM pretraining: combining a moderate LR decay with model averaging. This approach allows the model to strike a balance between learning effectively from high-quality data, reducing knowledge forgetting, and mitigating gradient noise. We find that this combination highlights a previously overlooked opportunity to improve pretraining by co-designing the data curriculum, LR schedule, and model averaging.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 13070
Loading