Adaptive Tool Generation with Models as Tools and Reinforcement Learning

Chenpeng Wang; zhanglei; chengxiaojie; Chunye Wang; LinFeng Yang; Lei Li

Adaptive Tool Generation with Models as Tools and Reinforcement Learning

Chenpeng Wang, zhanglei, chengxiaojie, Chunye Wang, LinFeng Yang, Lei Li

17 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Simulation-based Reasoning;Reinforcement Learning;Agentic AI;Tool-Augmented Language Models

TL;DR: We introduce the Model-as-Tool Reasoning framework, which decouples tool reasoning from execution using dynamic, verifiable contracts. It leverages a three-agent architecture and a novel reward function to ensure consistent and correct task solving.

Abstract: Tool-augmented language models have demonstrated strong capabilities, but their reliance on live API access creates scalability and reliability challenges during training and deployment. We propose MTR, a simulation-first training framework for tool-augmented reasoning. Instead of relying on live APIs, MTR learns from complete ReAct traces with schema-validated, simulated observations. Our approach operates through a multi-agent architecture where a ToolMaker generates task-specific, OpenAI-compatible tool interfaces, an AutoAgent produces structured think–act–observe sequences, and a ToolActor simulates realistic responses. Training proceeds in two stages: Stage-1 Supervised Fine-Tuning (SFT) teaches ``trace grammar'' from complete reasoning sequences; Stage-2 Group Relative Policy Optimization (GRPO) optimizes strategy with a composite trace reward that balances answer correctness and internal consistency. Across four multi-hop QA benchmarks (HotpotQA, MuSiQue, 2WikiMultiHopQA, Bamboogle), MTR attains competitive Exact Match (EM) scores to live-API systems and excels on reasoning-intensive tasks, suggesting that effective tool reasoning can be learned from structured traces without live interactions.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 9255

Loading