Keywords: reinforcement learning, large reasoning model, LLM reasoning
Abstract: Recent advances in large reasoning models (LRMs) such as OpenAI's o1 and Deepseek-R1 have demonstrated that reinforcement learning (RL) with outcome-based supervision can significantly enhance the reasoning abilities of language models. However, these improvements have so far relied on massive model scales and compute budgets, leaving open the question of whether RL-based scaling can be made both effective and efficient at smaller scales. In this work, we introduce DeepScaleR-1.5B, a 1.5B parameter model trained using reinforcement learning with a novel iterative context lengthening strategy. Our method begins with shorter context windows and progressively extends them throughout training, enabling the model to first learn to reason efficiently before learning to reason longer. This approach yields substantial performance gains with dramatically reduced computational cost. DeepScaleR-1.5B achieves 43.3% Pass@1 on the AIME2024 math benchmark—a 14.3 percentage point improvement over its base model and on par with OpenAI's o1-preview—while requiring a fraction of the compute. We provide a full training recipe, including dataset, code, hyperparameters, and training methodology, demonstrating that small models can be effectively scaled into strong math reasoners via RL.
Primary Area: reinforcement learning
Submission Number: 21966
Loading