GraphToolBench: Benchmarking LLMs for Sequential Graph Comprehension and Conflict Identification in Tool Learning

ACL ARR 2026 January Submission1830 Authors

31 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Tool Learning
Abstract: Large language models (LLMs) have advanced rapidly but still exhibit outdated knowledge, unreliable reasoning, and limited real‑world interaction. Tool learning addresses these gaps by enabling models to call external tools across four stages: task planning, tool selection, tool calling, and response generation. Current benchmarks mainly focus on final answer accuracy and do not assess intermediate capabilities such as sequential graph comprehension or conflict detection between tool outputs and model knowledge; they are also susceptible to network failures and usage costs. We introduce GraphToolBench, a comprehensive benchmark that covers all four tool learning stages and reduces practical constraints by providing an offline Model Context Protocol (MCP) Function library of executable Python functions derived from online tools. We develop a sampling procedure, named Conflict Potential Random Sampling (CPRS) to produce tool sets with controllable levels of disagreement between tool results and model knowledge. In view of these tool sets, we integrate the advanced LLM with human expertise to generate data that align with the characteristics of the four phases of tool learning. We further present GraphToolEval, a multi-dimensional evaluation suite that measures sequential graph understanding and conflict identification. Empirical results demonstrate deeper and more granular insights than prior benchmarks. Code and data are available at https://anonymous.4open.science/r/GraphToolBench.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation, benchmarking, NLP datasets, evaluation methodologies
Contribution Types: Data resources, Data analysis
Languages Studied: English, Chinese
Submission Number: 1830
Loading