CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

17 Sept 2025 (modified: 26 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models; LLM Reasoning;Curriculum Learning
Abstract: Recently, online Reinforcement Learning with Verifiable Rewards (RLVR) has become a key paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing methods typically treat all training samples uniformly, overlook the vast differences in problem difficulty relative to the model's current capabilities. This uniform training strategy leads to inefficient exploration of problems the model has already mastered, while lacking effective guidance on the problems that are challenging its abilities the most, limiting both learning efficiency and the performance upper-bound. To address this, we propose \textbf{CLPO (Curriculum-guided Learning for Policy Optimization)}, a novel algorithm that creates a dynamic pedagogical feedback loop within the policy optimization process. The core of CLPO is to leverage the model's own rollout performance to conduct real-time difficulty assessment, thereby constructing an \textbf{Online Curriculum}. This curriculum then guides an \textbf{Adaptive Problem Restructuring} mechanism, where the model acts as its own teacher: it diversifies medium-difficulty problems to promote generalization and simplifies hard problems to make them more accessible. Our approach transforms the static training procedure into a dynamic process that co-evolves with the model's capabilities. Experiments show that CLPO achieves \textbf{state-of-the-art (SOTA)} performance across eight challenging mathematical and general reasoning benchmarks, with an average \textbf{pass@1} improvement of \textbf{6.96\%} over ohter methods, demonstrating its potential for more efficiently training more capable reasoning models.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 9312
Loading