Failing Tools: Benchmarking LLM Agent Recovery Under Runtime Tool Failures

ACL ARR 2026 May Submission17380 Authors

26 May 2026 (modified: 16 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Agents, Tool Use, Function Calling, Benchmarking, Robustness, Reasoning, Hallucination, Evaluation
Abstract: Tool-augmented language model agents are increasingly deployed against external services that fail in messy, production-representative ways, yet existing function-calling benchmarks largely evaluate them on a ``happy path'' where tools are available, documentation is accurate, and observations can be trusted. We introduce Failing Tools, a benchmark that systematically injects runtime failures into multi-turn tool-calling scenarios and measures whether agents can detect failures, distinguish transient from permanent faults, retry or fall back appropriately, verify state by calling confirmation functions whenever available, and faithfully communicate residual uncertainty. Built on stateful, multi-domain APIs, the benchmark covers availability denial, data staleness, silent no-ops, corrupted state, schema mismatch, disambiguation failures, and compound cascades, and pairs each scenario with trajectory-level recovery criteria that go beyond final-answer accuracy to score detection, recovery strategy, safety, and calibration. Across frontier tool-calling models, strong performance under standard conditions does not transfer to unreliable tools: under our base recovery evaluator no model exceeds 11.47% accuracy on 218 scenarios, with the dominant failure being missing verification or recovery steps rather than incorrect tool selection. Failing Tools provides a practical framework for studying dependable agent behavior in realistic, partially observable tool environments and exposes a substantial gap between benchmark competence and deployment robustness.
Paper Type: Long
Research Area: LLM agents
Research Area Keywords: LLM agents, agent evaluation, tool learning, tool use, function calling, benchmark datasets, robustness, failure recovery, multi-step reasoning, runtime adaptation
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
EMNLP 2026 AI Reviewing Experiment: yes
Submission Number: 17380
Loading