Keywords: Reinforcement Learning, Pre-training, Reasoning, Training Dynamics
Abstract: The standard training pipeline for Large Language Models (LLMs) proceeds sequentially through pretraining, supervised fine-tuning (SFT), and reinforcement learning (RL). Motivated by the success of RL in improving reasoning, we investigate the potential for introducing the RL objective at earlier stages of pretraining. Specifically, we study the effect of applying RL directly to intermediate pretraining checkpoints against SFT-only training and the conventional SFT $\rightarrow$ RL pipeline. Our results show that RL alone can substantially improve reasoning performance even when applied after only 25% of pretraining. Moreover, the direct RL approach can achieve performance comparable to the standard SFT to RL pipeline. We then analyze how early-RL affects output distribution (as measured by pass@k) and show two opposing cases of RL sharpening versus expanding the model's distribution. Finally, we explore the role of rollout budgets in optimizing performance at early stages of training. Overall, our findings offer novel insights into the effects and potential benefits of introducing RL earlier in the pipeline than the current standard practice.
Submission Number: 83
Loading