MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers

Ziyang Luo; Zhiqi Shen; Wenzhuo Yang; Zirui Zhao; Prathyusha Jwalapuram; Amrita Saha; Doyen Sahoo; Silvio Savarese; Caiming Xiong; Junnan Li

MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers

Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, Junnan Li

Published: 28 Sept 2025, Last Modified: 16 Oct 2025SEA @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: MCP, Agents, Benchmark

Abstract: The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) to external data sources and tools, and it is rapidly gaining adoption across major AI platforms. However, existing benchmarks are overly simplistic and fail to capture real-world application challenges such as long-horizon reasoning and large, unfamiliar tool spaces. To address this critical gap, we introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs on realistic and difficult tasks through interaction with real-world MCP servers. Our benchmark spans 6 core domains and 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching. To ensure rigorous evaluation, we carefully design execution-based evaluators, including format evaluators for compliance, static evaluators for time-invariant content matching, and dynamic evaluators that automatically obtain real-time ground truth for temporally sensitive tasks. Through extensive evaluation of more than 20 leading LLMs, we find that even frontier models such as GPT-5-High (44.16% success rate) and Grok-4 (33.33% success rate) exhibit significant performance limitations. In addition, our benchmark poses a substantial long-context challenge, as the number of input tokens increases rapidly with each additional interaction step. It also introduces an unknown-tools challenge, since LLM agents often lack familiarity with the precise usage of certain MCP servers. Notably, enterprise-level agents such as Cursor and Claude Code fail to achieve better performance than the ReAct framework. Beyond evaluation, we open-source our extensible evaluation framework, enabling seamless integration of new LLMs, agents and MCP servers.

Archival Option: The authors of this submission do *not* want it to appear in the archival proceedings.

Submission Number: 91

Loading