Model Context Protocol for Vision Agents: Schema, Memory, and World Model Implications

Published: 23 Sept 2025, Last Modified: 22 Nov 2025LAWEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Model Context Protocol, vision systems, tool orchestration, agent orchestration, schema drift, memory modeling, protocol audit, workflow coordination, security analysis, benchmark validators, multimodal agents, tool-use environments, schema validation, benchmarks, computer vision workflows, multi-agent systems, security
TL;DR: We audit 91 MCP vision servers, revealing schema drift, memory bugs, and insecure tool calls. We propose protocol extensions and release validators to detect failures, improving reliability and security in compositional vision workflows.
Abstract: The Model Context Protocol (MCP) defines a schema-bound execution model for agent-tool interaction, enabling modular computer vision workflows without retraining. To our knowledge, this is the first protocol-level, deployment-scale audit of MCP in vision systems, identifying systemic weaknesses in schema semantics, interoperability, and runtime coordination. We analyze 91 publicly registered vision-centric MCP servers, annotated along nine dimensions of compositional fidelity, and develop an executable benchmark with validators to detect and categorize protocol violations. The audit reveals high prevalence of schema format divergence, missing runtime schema validation, undeclared coordinate conventions, and reliance on untracked bridging scripts. Validator-based testing quantifies these failures, with schema-format checks flagging misalignments in 78.0% of systems, coordinate-convention checks detecting spatial reference errors in 24.6%, and memory-scope checks issuing an average of 33.8 warnings per 100 executions. Security probes show that dynamic and multi-agent workflows exhibit elevated risks of privilege escalation and untyped tool connections. The proposed benchmark and validator suite, implemented in a controlled testbed, establishes a reproducible framework for measuring and improving the reliability and security of compositional vision workflows.
Submission Type: Benchmark Paper (4-9 Pages)
Submission Number: 112
Loading