GraphToolBench: Benchmarking LLMs for Sequential Graph Comprehension and Conflict Identification in Tool Learning

GraphToolBench: Benchmarking LLMs for Sequential Graph Comprehension and Conflict Identification in Tool Learning

ACL ARR 2026 January Submission1830 Authors

31 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Tool Learning

Abstract: Large language models (LLMs) have advanced rapidly but still exhibit outdated knowledge, unreliable reasoning, and limited real‑world interaction. Tool learning addresses these gaps by enabling models to call external tools across four stages: task planning, tool selection, tool calling, and response generation. Current benchmarks mainly focus on final answer accuracy and do not assess intermediate capabilities such as sequential graph comprehension or conflict detection between tool outputs and model knowledge; they are also susceptible to network failures and usage costs. We introduce GraphToolBench, a comprehensive benchmark that covers all four tool learning stages and reduces practical constraints by providing an offline Model Context Protocol (MCP) Function library of executable Python functions derived from online tools. We develop a sampling procedure, named Conflict Potential Random Sampling (CPRS) to produce tool sets with controllable levels of disagreement between tool results and model knowledge. In view of these tool sets, we integrate the advanced LLM with human expertise to generate data that align with the characteristics of the four phases of tool learning. We further present GraphToolEval, a multi-dimensional evaluation suite that measures sequential graph understanding and conflict identification. Empirical results demonstrate deeper and more granular insights than prior benchmarks. Code and data are available at https://anonymous.4open.science/r/GraphToolBench.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: corpus creation, benchmarking, NLP datasets, evaluation methodologies

Contribution Types: Data resources, Data analysis

Languages Studied: English, Chinese

Submission Number: 1830

Loading