Enabling Tool Use of Reasoning Models Without Verifiable Reward via SFT-RL Loop

Enabling Tool Use of Reasoning Models Without Verifiable Reward via SFT-RL Loop

ICLR 2026 Conference Submission20740 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Reasoning Models, Tool Use, Reinforcement Learning, Post-Training

Abstract: Large reasoning models have shown remarkable capabilities, but their internal knowledge is limited, restricting their ability to solve complex tasks. An attractive solution is to integrate external tools—such as Python for math reasoning or search engines for knowledge-intensive queries. Yet, teaching models to use tools effectively remains a significant challenge. Existing approaches often depend on reinforcement learning (RL) with accuracy-based verifiable rewards or cold-start pipelines that perform supervised fine-tuning (SFT) followed by an RL stage. These methods are shown to be notoriously unstable, prone to entropy collapse or convergence to suboptimal behaviors. The problem is compounded in real-world tool-use scenarios where accuracy signals are either unavailable or unverifiable. To address this, we propose $\texttt{SR-Loop}$, a general training framework that alternates between SFT and RL phases without relying on accuracy-based rewards in the RL stage. The SFT phase preserves output structure and constrains harmful exploration by imitating expert demonstrations, while the RL phase encourages discovery of new behaviors and improvements beyond the initial policy. By repeatedly cycling between these phases, $\texttt{SR-Loop}$ achieves stable learning and progressively enhances tool-use capabilities using only structural and execution-based rewards. Experiments show that $\texttt{SR-Loop}$ not only prevents training collapse but also delivers competitive performance on complex tool-use reasoning tasks—without requiring explicit accuracy supervision during RL. Moreover, the framework generalizes beyond tool use, proving effective for training general reasoning models even in settings without external tools.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 20740

Loading