MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

ICLR 2026 Conference Submission20592 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Agent, Tool Use, Benchmark, Model Context Protocol
TL;DR: MCPMark is a comprehensive benchmark for stress-testing agents and models in realistic MCP-based scenarios, with 127 tasks across Notion, GitHub, Filesystem, PostgreSQL, and Playwright.
Abstract: The MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this, we propose \texttt{MCPMark}, a benchmark designed to evaluate realistic and comprehensive MCP use, comprising $127$ high-quality tasks collaboratively created by human experts and AI agents. Specifically, each task starts from a curated initial state and incldes a programmatic script for automatic verification. Moreover, these tasks require richer and more varied interactions with the environment, involving diverse create, read, update, and delete (CRUD) operations. We conduct comprehensive evaluation of cutting-edge LLMs using a minimal agent framework that operates in a tool-calling loop. Empirical results show that the best-performing model, \texttt{gpt-5-medium}, reaches only $52.56$\% pass@1 and $33.86$\% pass^4, while other widely regarded strong models, including \texttt{claude-sonnet-4} and \texttt{o3}, fall below $30$\% pass@1 and $15$\% pass^4. On average, LLMs require $16.18$ execution turns and $17.38$ tool calls per task, substantially exceeding those in previous MCP benchmarks and demonstrating the stress-testing nature of \texttt{MCPMark}.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 20592
Loading