ActionEQA: Action Interface for Embodied Question Answering

TMLR Paper6540 Authors

17 Nov 2025 (modified: 19 Nov 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: While Vision-Language Models (VLMs) are increasingly integral to embodied intelligence, a significant action understanding bottleneck persists in translating high-level semantic instructions into precise low-level physical actions. However, current benchmarks for embodied agents primarily focus on high-level perception and planning, failing to capture the depth and nature of this semantic-to-physical gap. To address this, we introduce ActionEQA, the first Embodied Question Answering (EQA) benchmark designed to methodically evaluate the ability of VLMs to bridge this critical yet underexplored semantic-physical divide. Grounded in real-world robotics data, ActionEQA thoroughly analyzes VLMs’ grasp of the action interface using a dual-tier design: (1) a Three-Tiered Action Hierarchy for pinpointing the depth at which VLMs' action reasoning collapses. (2) Bidirectional Reasoning Tasks for testing whether VLMs struggle more to predict action outcomes or infer the actions that led to them. Our key findings reveal: (1) The primary bottleneck in action understanding occurs at the mid-level, arising from the challenge of grounding compositional language in 3D physical geometry. (2) VLMs are more adept at inferring past actions than predicting their future outcomes. (3) Richer visual inputs require greater spatial reasoning from VLMs to map actions to physical geometry. (4) Within the action hierarchy, model failures shift from predominantly perceptual errors at the high level to flawed geometric and physical reasoning at the low level.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Tim_Georg_Johann_Rudner1
Submission Number: 6540
Loading