MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

Published: 26 Jan 2026, Last Modified: 28 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Agent, Tool Use, Benchmark, Model Context Protocol
TL;DR: MCPMark is a comprehensive benchmark for stress-testing agents and models in realistic MCP-based scenarios, with 127 tasks across Notion, GitHub, Filesystem, PostgreSQL, and Playwright.
Abstract: The MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this gap, we propose \texttt{MCPMark}, a benchmark designed to evaluate MCP use in a more realistic and comprehensive manner. It consists of $127$ high-quality tasks collaboratively created by domain experts and AI agents, each with a curated initial state and programmatic verification script. These tasks demand diverse CRUD operations and richer environmental interactions. We evaluate cutting-edge LLMs using a minimal agent framework. The best-performing model, \texttt{gpt-5-medium}, reaches only $52.56$\% pass@1 and $33.86$\% pass\textasciicircum{}4, while other strong models including \texttt{claude-sonnet-4} and \texttt{o3} fall below $30$\% pass@1 and $15$\% pass\textasciicircum{}4. On average, LLMs require $16.2$ turns and $17.4$ tool calls per task, highlighting the stress-testing nature of \texttt{MCPMark}.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 20592
Loading