Rule-Based Reference Updates after R1-Based Post Reinforcement Learning For Small Reasoning Language Models

17 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: reinforcement learning, group policy, adaptive learning
Abstract: Inference scaling improves LLM reasoning, with reinforcement learning as a key driver. Although, post-training reinforcement learning and its curriculum learning variants offer significant benefits in enhancing the reasoning ability of large language models, we designate this process as Phase 1. Following this, we propose Phase 2: rule-based reference model updates in reinforcement learning after Phase 1 to explore the potential of reference model updates following R1-Like reinforcement learning. In details, we introduce a rule-based reference updates reinforcement learning approach continues to enhance the reasoning capabilities of small-sized large language models after current classical post-training reinforcement learning. In particular, a $1.5B$-parameter LLM achieves $60.2\%$ on AIME24, $48.2\%$ on AIME25 and $91.5\%$ on Math500 and $1.5\%-4\%$ score improvement on AMC, Minera and Olympia. These results, enabled by the proposed rule-based reference model updates reinforcement learning algorithm, demonstrate math reasoning capabilities comparable to O1-mini/O3-mini—achievable within a typical school laboratory setting. In addition, we open-source both the dataset and model checkpoints to support future research in large-scale reinforcement learning for LLMs.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 9397
Loading