Keywords: human-in-the-loop, applications, conversational modeling
Abstract: The transition of Large Language Models (LLMs) from conversational interfaces to autonomous agents hinges on their ability to master complex tool-use trajectories. However, training reliable agents is bottlenecked by the scarcity of high-fidelity interaction data. While synthetic data generation offers a scalable solution, ensuring the quality of these dynamic, multi-turn trajectories remains a formidable challenge. Existing approaches often rely on ad-hoc heuristics or filtering mechanisms inextricably coupled with the generation process, limiting their scalability and adaptation to new domains. In this work, we introduce Tool-Verifier-7B, a plug-and-play framework designed to verify the quality of tool-use data through a decoupled architecture. Tool-Verifier's training data employs a dual-consistency mechanism, validating both the outcome and the process of interactions, to identify subtle reasoning errors and policy violations. We further introduce Tool-Verify, a curated dataset of 3,295 multi-turn tool-use samples, and Tool-V-Bench, a high-quality benchmark of 165 human-validated trajectories. extensive experiments demonstrate that Tool-Verifier-7B not only outperforms GPT-4o on verification benchmarks but also exhibits strong generalization capabilities. Our verifier effectively transfers to out-of-domain tasks (e.g., Airline, BFCL).
Paper Type: Long
Research Area: Dialogue and Interactive Systems
Research Area Keywords: human-in-the-loop, applications, conversational modeling
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: english
Submission Number: 10904
Loading