ToolFailBench: Diagnosing Tool-Use Failures in LLM Agents

Published: 23 May 2026, Last Modified: 23 May 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0
Keywords: tool use, function calling, LLM agents, diagnostic benchmark, failure mode taxonomy, tool-use faithfulness, parametric traps, agent evaluation, tool calling reliability
TL;DR: A diagnostic benchmark that separates LLM tool-use failures into Tool-Skip, Result-Ignore, Output-Fabrication, and Unnecessary-Tool-Use.
Abstract: Tool calling is central to modern language model agents, but aggregate benchmark scores often hide where tool use fails. A model that never calls a needed tool and a model that calls the tool but ignores the result can look similar under final task accuracy. We introduce *ToolFailBench*, a diagnostic benchmark for measuring tool-use failures across $1{,}000$ tasks in finance, medicine, law, cybersecurity, and real estate.Tool-required tasks return values the model wouldn't guess, forcing it to trust the tool. Control tasks attach the same tools but should be answered directly. We label each trace with a failure-mode taxonomy covering Tool-Skip, Result-Ignore, Output-Fabrication, and Unnecessary-Tool-Use, using a deterministic rule classifier and two independent LLM judges aggregated by majority vote. Across 19 headline models, the best reaches 86.33\% Clean Tool-Use Rate, showing that faithful tool use is not saturated. More importantly, models with similar aggregate scores fail in different ways: most stay disciplined on no-tool controls, while Llama-3.1 models show an Always-Call pattern, and at the same parameter scale Llama-3.1-70B and Qwen2.5-72B differ by 89 percentage points on control-task accuracy. Tool-use evaluation should measure not only whether agents call tools, but whether they use tool outputs correctly and avoid tools when none is needed.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 266
Loading