MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

Zhenting Wang; Qi Chang; Hemani Patel; Shashank Biju; Cheng-En Wu; Quan Liu; Aolin Ding; Alireza Rezazadeh; Ankit Shah; Yujia Bao; Eugene Siow

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, Eugene Siow

Published: 28 Sept 2025, Last Modified: 29 Oct 2025SEA @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Tool-using Agent; Real-World Tasks; Model Context Protocol

Abstract: We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input–output coupling. Also, tasks in MCP-Bench test agents’ ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows—capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectorylevel planning and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench.

Archival Option: The authors of this submission do *not* want it to appear in the archival proceedings.

Submission Number: 147

Loading