On the Limits of Curriculum Learning for Post-Training Large Language Models

On the Limits of Curriculum Learning for Post-Training Large Language Models

ICLR 2026 Conference Submission18373 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Curriculum Learning, SFT, RL, post-training, finetuning, synthetic datasets, mathgap, kk, data contamination

TL;DR: There is no single best curriculum strategy that leads to performance gains in synthetic data settings, where data contamination is not present. Standard random sampling often performs very competitively.

Abstract: Large language models (LLMs) excel at many common-sense tasks, yet they remain brittle when required to perform consistent multi-step reasoning. Evaluations on benchmarks such as AMC or AIME25 are often affected by data contamination, motivating our focus on synthetic reasoning tasks with controllable difficulty. Synthetic datasets allow us to generate problems whose difficulty directly corresponds to the number of (verbalized) reasoning steps required. By focusing on synthetic tasks with minimal natural language complexity, we ensure that our conclusions are driven by reasoning ability rather than sophisticated linguistic understanding. We investigate generalization to higher difficulty levels at the granularity of individual difficulties, a setting that differs from the standard out-of-distribution evaluation, which typically tests on entirely different tasks. To improve generalization to harder problems, we study curriculum learning (CL) as a mechanism to exploit difficulty during post-training. Across multiple synthetic reasoning tasks and a family of medium-sized models, we find that CL has no significant impact under either supervised fine-tuning (SFT) or reinforcement learning (RL). Moreover, the optimal CL schedule varies across datasets and models, while standard random sampling performs competitively. We identify response length as a key factor driving model performance, and observe that CL schedules do not significantly impact response length, explaining why SFT performance does not improve with CL. While pre-training commonly adopts data mixing strategies akin to curriculum learning, these findings call into question the usefulness of curriculum learning for post-training in mathematical reasoning tasks, and suggest that future work should explore alternative mechanisms for strengthening pure reasoning robustness in LLMs.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 18373

Loading