Abstract: Large reasoning models (LRMs) like OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable capabilities in complex reasoning tasks through the utilization of long Chain-of-thought (CoT). However, these models often suffer from hallucinations and inefficiencies due to their reliance solely on internal reasoning processes. In this paper, we introduce START (\textbf{S}elf-\textbf{Ta}ught \textbf{R}easoner with \textbf{T}ool), a novel tool-integrated long CoT reasoning LLM that significantly enhances reasoning capabilities by leveraging external tools. Through code execution, START is capable of performing complex computations , self-checking, exploring diverse methods, and self-debugging, thereby addressing limitations of LRMs.
% A key innovation in START is the Hint-infer technique. We demonstrate that inserting artificially designed hints (e.g., “Wait, maybe using Python here is a good idea.”) during the inference process of LRMs effectively stimulates the model's ability to utilize external tools without the need for demonstration data. This self-learning framework, which includes Hint-RFT and Reject Sampling Fine-Tuning, facilitates autonomous data generation and self-improvement, significantly enhancing performance and efficient tool utilization with only one input LRM required.
The core innovation of START lies in its self-learning framework, which comprises two key techniques: 1) Hint-infer: We demonstrate that inserting artificially designed hints
(e.g., “Wait, maybe using Python here is a good idea.”)
during the inference process of a LRM effectively stimulates its ability to utilize external tools without the need for any demonstration data. Hint-infer can also serve as a simple and effective sequential test-time scaling method; 2) Hint Rejection Sampling Fine-Tuning (Hint-RFT): Hint-RFT combines Hint-infer and RFT by scoring, filtering, and modifying the reasoning trajectories with tool invocation generated by a LRM via Hint-infer, followed by fine-tuning the LRM.
Through this framework, we have fine-tuned the QwQ-32B model to achieve the START.
On PhD-level science QA (GPQA), competition-level math benchmarks (AMC23, AIME24, AIME25), and the competition-level code benchmark (LiveCodeBench), START achieves accuracy rates of 63.6\%, 95.0\%, 66.7\%, 47.8\%, and 47.3\%, respectively. It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B and the proprietary model o1-Preview.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Long Chain-of-Thought, Tool-integrated Reasoning, Large Reasoning Model
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: English
Submission Number: 5951
Loading