Keywords: Model Context Protocol, MCP-use, Benchmark
TL;DR: We introduce LiveMCPBench, a benchmark of 95 real-world daily tasks grounded in the MCP ecosystem, designed to evaluate LLM agents at scale across diverse tools.
Abstract: Model Context Protocol (MCP) has become a key infrastructure for connecting LLMs with external tools, scaling to 10,000+ MCP servers with diverse tools. Unfortunately, there is still a large gap between real-world MCP usage and current evaluation: they typically assume single-server settings and directly inject tools into the model’s context, bypassing the challenges of large-scale retrieval and multi-tool composition. To bridge this gap, we propose **LiveMCPBench**, which evaluates 95 real-world daily tasks explicitly constructed to stress diverse tools and scaled multi-server routing. The benchmark includes a ready-to-deploy tool suite of 70 servers with 527 tools, ensuring reproducibility without scattered API configuration. We further introduce an LLM-as-a-Judge evaluation framework that directly verifies task outcomes, handling dynamic data sources and multiple valid solution paths. We benchmark 10 state-of-the-art LLMs and observe a substantial performance gap: while Claude-Sonnet-4 reaches 78.95% task success, most models achieve only 30–50%. Our analysis reveals that active tool composition strongly correlates with task success, whereas retrieval errors account for nearly half of all failures-highlighting retrieval as the dominant bottleneck. Together, these results provide the first large-scale, reproducible diagnosis of MCP agent capabilities and point towards future research on improving retrieval robustness and encouraging effective tool composition. Code and data will be released upon publication.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 10721
Loading