Beyond Two-Stage Training: Integrating SFT and RL for Improved Reasoning in LLMs

12 May 2025 (modified: 26 Nov 2025)Submitted to NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Reasoning, RL, SFT
Abstract: Reinforcement learning (RL) has proven effective in incentiving the reasoning abilities of large language models (LLMs), but faces significant efficiency challenges due to its extensive trial-and-error nature. A common practice is to employ supervised fine-tuning (SFT) as a warm-up stage; however, this decoupled two-stage approach limits interaction between SFT and RL, thereby constraining overall effectiveness. This study introduces a novel method for learning reasoning models that employs bilevel optimization to facilitate better cooperation between these training paradigms. Specifically, the SFT objective is explicitly conditioned on the optimal solution of the RL objective. During training, lower-level updates enable the model to receive SFT supervision concurrently with RL-based exploration, while upper-level updates are optimized to ensure that the joint training yields higher rewards than RL alone. Empirical evaluations on five reasoning benchmarks demonstrate that our method consistently outperforms baselines and achieves a better balance between effectiveness and efficiency.
Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)
Submission Number: 27753
Loading