EgoErrorVQA: Assess Egocentric Comprehension Capabilities Through Procedural Errors For Ego-Agentic AI
Keywords: Egocentric Vision, AI Agent, VQA, Procedural Comprehension.
Abstract: An increasing number of intelligent systems interact with daily human activities, making robust egocentric visual information processing essential. However, existing benchmarks for Visual Agents and Visual Language Models (VLMs) primarily focus on third-person perspectives or capture only short-term visual understanding, limiting their ability to model long-horizon, action-centric procedures. To bridge this gap, we propose EgoErrorVQA, the first visual question answering (VQA) task designed for egocentric procedure comprehension with explicit modeling of procedural errors that reflect common execution failures in real-world tasks. EgoErrorVQA evaluates a range of models using both open-ended and multiple-choice questions, revealing persistent weaknesses in handling procedures with stepwise logical dependencies. In addition, we develop a user-friendly evaluator agent based on the Agent2Agent (A2A) protocol, enabling rigorous and standardized evaluation of visual agents through VQA-based interaction. Finally, we introduce EgoError-CoT, a training-free framework that improves reasoning through in-context learning and task-specific chain-of-thought prompting, delivering consistent gains without additional training.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, automatic creation and evaluation of language resources, evaluation methodologies, evaluation, metrics.
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Data resources
Languages Studied: English
Submission Number: 3170
Loading