Training Verifier to Assessing Complex Real-World Tool-Use Trajectories

Training Verifier to Assessing Complex Real-World Tool-Use Trajectories

ACL ARR 2026 January Submission10904 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: human-in-the-loop, applications, conversational modeling

Abstract: The transition of Large Language Models (LLMs) from conversational interfaces to autonomous agents hinges on their ability to master complex tool-use trajectories. However, training reliable agents is bottlenecked by the scarcity of high-fidelity interaction data. While synthetic data generation offers a scalable solution, ensuring the quality of these dynamic, multi-turn trajectories remains a formidable challenge. Existing approaches often rely on ad-hoc heuristics or filtering mechanisms inextricably coupled with the generation process, limiting their scalability and adaptation to new domains. In this work, we introduce Tool-Verifier-7B, a plug-and-play framework designed to verify the quality of tool-use data through a decoupled architecture. Tool-Verifier's training data employs a dual-consistency mechanism, validating both the outcome and the process of interactions, to identify subtle reasoning errors and policy violations. We further introduce Tool-Verify, a curated dataset of 3,295 multi-turn tool-use samples, and Tool-V-Bench, a high-quality benchmark of 165 human-validated trajectories. extensive experiments demonstrate that Tool-Verifier-7B not only outperforms GPT-4o on verification benchmarks but also exhibits strong generalization capabilities. Our verifier effectively transfers to out-of-domain tasks (e.g., Airline, BFCL).

Paper Type: Long

Research Area: Dialogue and Interactive Systems

Research Area Keywords: human-in-the-loop, applications, conversational modeling

Contribution Types: NLP engineering experiment, Data resources, Data analysis

Languages Studied: english

Submission Number: 10904

Loading