From Failure to Mastery: Generating Hard Samples for Tool-use Agents

ACL ARR 2026 January Submission2019 Authors

01 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Tool-use; Function Call; LLM/AI agents; applications
Abstract: The advancement of LLM agents with tool-use capabilities requires diverse and complex training corpora. Existing data generation methods, which predominantly follow a paradigm of random sampling and shallow generation, often yield simple and homogeneous trajectories that fail to capture complex, implicit logical dependencies. To bridge this gap, we introduce **HardGen**, an automatic agentic pipeline designed to generate hard tool-use training samples with verifiable reasoning. $Firstly$, HardGen establishes a dynamic API Graph built upon agent failure cases, from which it samples to synthesize hard traces. $Secondly$, these traces serve as conditional priors to guide the instantiation of modular, abstract advanced tools, which are subsequently leveraged to formulate hard queries. $Finally$, the advanced tools and hard queries enable the generation of verifiable complex Chain-of-Thought (CoT), with a closed-loop evaluation feedback steering the continuous refinement of the process. Extensive evaluations demonstrate that a 4B parameter model trained with our curated dataset achieves superior performance compared to several strong competitors ($e.g.$, GPT-5.2, Gemini-3-Pro and Claude-Opus-4.5). Our code, models, and dataset will be open-sourced to facilitate future research.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: applications; chain-of-thought; LLM/AI agents;
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 2019
Loading