Online Difficulty Filtering for Reasoning Oriented Reinforcement Learning

ACL ARR 2025 July Submission1215 Authors

29 Jul 2025 (modified: 19 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Reasoning-Oriented Reinforcement Learning (RORL) enhances the reasoning ability of Large Language Models (LLMs). However, due to the sparsity of rewards in RORL, effective training is highly dependent on the selection of problems of appropriate difficulty. Although curriculum learning attempts to address this by adjusting difficulty, it often relies on static schedules, and even recent online filtering methods lack theoretical grounding and a systematic understanding of their effectiveness. In this work, we theoretically and empirically show that curating the batch with the problems that the training model achieves intermediate accuracy on the fly can maximize the effectiveness of RORL training, namely balanced online difficulty filtering. We first derive that the lower bound of the KL divergence between the initial and the optimal policy can be expressed with the variance of the sampled accuracy. Building on those insights, we show that balanced filtering can maximize the lower bound, leading to better performance. Experimental results across five challenging math reasoning benchmarks with 3B and 7B scale models show that balanced online filtering yields an additional 10\% in AIME and 13\% in AMC with scalability. Moreover, further analysis shows the gains in sample and training time efficiency, exceeding the plain GRPO within 60\% training time and the training set volume.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: reasoning, reinforcement learning, curriculum learning
Languages Studied: English
Submission Number: 1215
Loading