TOOLWEAVE: FINE-GRAINED AND CONTROLLABLE SYNTHETIC DATA GENERATION FOR MULTI-TURN TOOL CALLING WITH NON-FRONTIER LLMS
Keywords: Tool Calling, Synthetic Data Generation, Multi-Turn Dialogues, Large Language Models (LLMs), License-Friendly Data
TL;DR: ToolWeave is a fully synthetic data generation framework that uses open-source models to create both the APIs and the multi-turn dialogues needed to fine-tune smaller, license-friendly LLMs for complex tool-calling.
Abstract: Multi-turn tool-calling is a crucial capability for LLM-based agents and is typically improved via supervised fine-tuning on synthetic data. Existing multi-turn tool-calling synthetic data pipelines often rely on proprietary frontier LLMs (e.g., GPT-4) or commercial APIs (e.g., RapidAPI), introducing restrictive licensing. In contrast, data generated directly from open LLMs suffers from low fidelity, poor
diversity, and weak adherence to multi-constraint instructions, resulting in producing lower-quality datasets than frontier models. To address these limitations, we propose ToolWeave, a modular and controllable pipeline that synthesises high-quality multi-turn tool-calling datasets using non-frontier, license-friendly LLMs. ToolWeave supports both API and dialogue synthesis. Our framework’s novelty
is threefold: (1) it is fully synthetic; given only a domain name, it builds a domain context from Wikipedia and Wikidata to synthesize a Tool Graph of APIs. (2) In contrast to other pipelines’ single, failure-prone planning step, ToolWeave’s scaffolding process first generates a high-level goal from the Tool Graph, then decomposes it into a turn-level dialogue plan. This two-stage approach enables non-frontier LLMs to generate high-fidelity, grounded dialogues. (3) A final post-processing stage injects lexical diversity and robustness patterns (e.g., error recovery) to simulate real-world scenarios. To validate our framework, we generated a dataset of ~3.2k dialogues using the open-source gpt-oss-120b. Compared to baselines, ToolFlow and ToolDial, ToolWeave shows clear gains: on the BFCL benchmark, our data improves Llama-3.1-70B to 33.25% (vs. ToolFlow’s 21.00% & ToolDial’s 3.75%) and Phi-4 to 24.50% (vs. ToolFlow’s 8.88% & ToolDial’s
2.0%). Our data also shows strong generalization, with peak gains of 37.6% on the API Bank benchmark.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23553
Loading