Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning

ICLR 2026 Conference Submission15728 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Model, Reasoning Model, Reinforcement Learning, SFT, Meta-Learning
Abstract: Reinforcement learning (RL) has proven effective in incentivizing the reasoning abilities of large language models (LLMs), but suffers from severe efficiency challenges due to its trial-and-error nature. While the common practice employs supervised fine-tuning (SFT) as RL warmup, its distribution mismatches the policy’s rollouts. This mismatch produces a dip-then-rise dynamic: early RL forgets SFT-acquired behavior and slowly re-explores, resulting in limited effectiveness and inefficient exploration. We introduce BRIDGE, a novel method to employ bilevel optimization to facilitate better cooperation between these training paradigms. By conditioning the SFT objective on the optimal RL policy, our approach enables SFT to meta-learn how to guide RL's optimization process. During training, the lower-level performs RL updates while simultaneously receiving SFT supervision, while the upper-level explicitly maximizes the cooperative gain—the performance advantage of joint SFT-RL training over RL alone. Empirical evaluations across three LLMs and five reasoning benchmarks demonstrate that our method consistently outperforms baselines and achieves a better balance between effectiveness and efficiency. Specifically, BRIDGE achieves 44\% faster training with a 13\% performance gain on Qwen2.5-3B, and 14\% faster training with a 10\% improvement on Qwen3-8B.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 15728
Loading