Abstract: We study why Tool-Integrated Reasoning (TIR) makes Large Language Models (LLMs) more
capable. While LLMs integrated with tools like Python code interpreters show great promise, a principled
theory explaining why this paradigm is effective has been missing. This work provides the first formal
proof that TIR fundamentally expands an LLM’s capabilities. We demonstrate that tools enable a strict
expansion of the model’s empirical and feasible support, breaking the capability ceiling of pure-text models
by unlocking problem-solving strategies that are otherwise impossible or intractably verbose. To guide
model behavior without compromising training stability and performance, we also introduce Advantage
Shaping Policy Optimization (ASPO), a novel algorithm that directly modifies the advantage function
to guide the policy behavior. We conduct comprehensive experiments on challenging mathematical
benchmarks, leveraging a Python interpreter as the external tool. Our results show that the TIR model
decisively outperforms its pure-text counterpart on the pass@𝑘 metric. Crucially, this advantage is not
confined to computationally-intensive problems but extends to those requiring significant abstract insight.
We further identify the emergent cognitive patterns that illustrate how models learn to think with tools.
Finally, we report improved tool usage behavior with early code invocation and much more interactive
turns with ASPO. Overall, our work provides the first principled explanation for TIR’s success, shifting
the focus from the mere fact that tools work to why and how they enable more powerful reasoning.
Loading