Benchmarking Without Constructs: A Measurement Theory Critique of MCP Evaluation Frameworks

Published: 29 Apr 2026, Last Modified: 12 May 2026Eval Eval @ ACL 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: evaluation methodology, construct validity, agentic AI, tool use, Model Context Protocol, benchmarking, measurement theory, LLM agents
Abstract: The Model Context Protocol (MCP) has rapidly become a de facto standard for connecting large language models to external tools, prompting a wave of benchmarks---MCP-Bench, MCP-Universe, MCPMark, and others---aimed at evaluating agent competence in tool-use scenarios. We argue that this benchmark proliferation has outpaced construct definition: each framework implicitly encodes a different theory of what ``MCP competence'' means, yet none explicitly operationalizes the construct it purports to measure. Drawing on classical measurement theory, we analyze three prominent MCP benchmarks along the axes of construct validity, evaluation reliability, and reproducibility. We find that the field faces a measurement crisis analogous to early psychometrics: instruments are being built before the constructs they target have been agreed upon. We propose that evaluation researchers adopt explicit construct operationalization, multi-trait validation protocols, and a structured interdisciplinary consensus process before further benchmark proliferation.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Type: Provocation
Archival Status: Non-archival
Submission Number: 38
Loading