MetaSpace: Metamorphic Testing for Spatial Cognition in Embodied Agents
Abstract: An embodied agent is an intelligent entity that interacts with its environment through a physical body. Currently, the evaluation of embodied agents primarily relies on two paradigms: (1) manually annotated Visual Question Answering (VQA) pairs and (2) high-level task completion metrics, such as success in navigation or manipulation. The former is labor-intensive and subject to variability in annotation quality. The latter may obscure critical vulnerabilities, allowing agents to complete tasks through suboptimal means or safety violations, thereby concealing safety risks and inefficiencies. Given that spatial cognition is the cornerstone for executing embodied tasks, there is a pressing need to assess whether embodied agents possess robust spatial cognition during task execution.
Inspired by metamorphic testing principles in software engineering, we propose MetaSpace, a novel framework designed to evaluate the spatial cognition of agents. By leveraging spatiotemporal multimodal states derived from real execution trajectories, MetaSpace automatically generates test cases based on predefined metamorphic relations (MRs) grounded in logical rules and physical laws. Crucially, we encode these MRs as executable rules in a logic programming language (Prolog). Violations of these relations indicate failures in spatial cognition. Our empirical evaluation across three embodied scenarios demonstrates that MetaSpace successfully detects 90,422 spatial cognition errors in state-of-the-art (SOTA) MLLM-driven agents. We introduce the Spatial Cognition (SC) score to quantify performance. Results indicate that all SOTA agents achieve average scores between 0.44 and 0.52, significantly lower than the human benchmark of 0.96. Additionally, these agents struggle with directional tasks, with SC scores consistently below 0.38. In contrast, their performance in magnitude-related tasks is relatively better, with most SC scores exceeding 0.5. To mitigate the identified spatial cognition errors, we explore potential improvement strategies. Preliminary results suggest that traditional prompting techniques (e.g., Chain of Thought) are limited, while spatially-aware prompting (e.g., cognitive maps) shows promise. Our findings underscore the importance of ongoing community efforts to enhance embodied agent performance by prioritizing the improvement of spatial cognition, a fundamental requirement for executing embodied tasks.
Loading