LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

Mo guozhao; Wenliang Zhong; Jiawei Chen; Qianhao Yuan; Xuanang Chen; Yaojie Lu; Hongyu Lin; Ben He; Xianpei Han; Le Sun

LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

Mo guozhao, Wenliang Zhong, Jiawei Chen, Qianhao Yuan, Xuanang Chen, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Le Sun

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Model Context Protocol, MCP-use, Benchmark

TL;DR: We introduce LiveMCPBench, a benchmark of 95 real-world daily tasks grounded in the MCP ecosystem, designed to evaluate LLM agents at scale across diverse tools.

Abstract: Model Context Protocol (MCP) has become a key infrastructure for connecting LLMs with external tools, scaling to 10,000+ MCP servers with diverse tools. Unfortunately, there is still a large gap between real-world MCP usage and current evaluation: they typically assume single-server settings and directly inject tools into the model’s context, bypassing the challenges of large-scale retrieval and multi-tool composition. To bridge this gap, we propose **LiveMCPBench**, which evaluates 95 real-world daily tasks explicitly constructed to stress diverse tools and scaled multi-server routing. The benchmark includes a ready-to-deploy tool suite of 70 servers with 527 tools, ensuring reproducibility without scattered API configuration. We further introduce an LLM-as-a-Judge evaluation framework that directly verifies task outcomes, handling dynamic data sources and multiple valid solution paths. We benchmark 10 state-of-the-art LLMs and observe a substantial performance gap: while Claude-Sonnet-4 reaches 78.95% task success, most models achieve only 30–50%. Our analysis reveals that active tool composition strongly correlates with task success, whereas retrieval errors account for nearly half of all failures-highlighting retrieval as the dominant bottleneck. Together, these results provide the first large-scale, reproducible diagnosis of MCP agent capabilities and point towards future research on improving retrieval robustness and encouraging effective tool composition. Code and data will be released upon publication.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 10721

Loading