Keywords: Large Language Models, Tool Integrated Reasoning, Reinforcement Learning, Code Generation, LLM Reasoning
Abstract: Large Language Models (LLMs) can enhance their reasoning by interacting with external tools, a paradigm known as Tool-Integrated Reasoning (TIR). However, extending TIR to multi-turn settings using Reinforcement Learning (RL) often exhibits training instability and degraded performance. We attribute the instability to harmful negative samples resulting from distributional drift and compounding errors induced by using external tool outputs during multi-turn rollout. To address this issue, we introduce SimpleTIR, a simple method that stabilizes multi-turn TIR training via filtering out trajectories with "void turns", i.e., turns that yield neither a code block nor a final answer. Specifically, we remove those trajectories from the policy update to block harmful gradients, while retaining them in advantage estimation to keep the estimate unbiased. Extensive experiments show that SimpleTIR effectively mitigates gradient norm explosion and stabilizes multi-turn RL training from base models. It achieves state-of-the-art performance on challenging math reasoning benchmarks, including an AIME24 score of 50.5 starting from the Qwen2.5-7B base model. SimpleTIR also promotes more diverse reasoning behaviors such as self-correction and cross-validation, outperforming prior methods trained from stronger instruction-tuned models.
Submission Number: 79
Loading