ToolFuzz: Automated Agent Tool Testing

ToolFuzz: Automated Agent Tool Testing

ICLR 2026 Conference Submission24704 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI Agents, LLM Agents, Fuzzing, LLM Fuzzing, Tool Documentation

TL;DR: We introduce ToolFuzz, the first method to automatically test tool used by LLM agents to uncover specification errors leading to crashes and incorrect responses.

Abstract: Large Language Model Agents (LLM Agents) leverage the advanced reasoning capabilities of LLMs in real-world applications. To interact with the environment, these agents require tools such as web searches or database APIs. As the agent provides the LLM with tools documentation alongside a user query, the completeness and correctness of this documentation is critical. However, tool documentation is often over-, under-, or ill-specified, impeding the agent's accuracy. Standard software testing approaches struggle to identify these errors as the documentation is usually in natural language. Thus, despite its importance, there currently exists no automated method to test agent tool usage. To address this, we present ToolFuzz, the first automated agent tool testing method. ToolFuzz combines LLMs with fuzzing to generate diverse and natural user queries causing tool runtime errors or semantically incorrect agent responses. We evaluate ToolFuzz on 139 tools from the community based open source LangChain and the production ready closed source Composio and find that all LangChain tools and the majority of Composio tools are erroneous. To validate the relevance of errors, identified by ToolFuzz, we design an automated pipeline to improve tool documentation. Specifically, we introduce two novel benchmarks—over 300 tasks, known ground truth, and real environments based on GitHub and terminal file management. Our automated tool-fixing pipeline increases accuracy from 22.9% to 35.4% on GitHub tasks and from 29% to 39% on file management tasks.ToolFuzz consistently outperforms the baselines and identifies 50% more unique errors while reducing the False Discovery Rate by 4.5x, making it a key component for building reliable AI agents.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 24704

Loading