Keywords: Embodied AI, Failure, VLMs, Household
Abstract: Failures are inevitable when embodied agents execute complex tasks.
Visual-language models (VLMs) serve as the core component of embodied agents in perceiving the environment and making decisions.
Assessing the capabilities of VLMs in detecting and reasoning about failures has become increasingly important.
Previous work primarily considered low-level manipulation failures (e.g., 3cm grasp offsets), neglecting high-level failures arising during long-horizon task execution (e.g., object-dropping failure in the ``clean room'' task) by embodied agents.
In this paper, we propose FAER, a failure-aware benchmark aiming to evaluate the performance of VLMs in terms of failure detection, failure categorization, failure description, and failure correction in long-horizon tasks.
FAER comprises 3,323 episodes, spanning 3 scenes, 65 tasks, and 83 objects.
We assess the performance of 16 widely utilized VLMs and 4 LLMs for FAER tasks.
Experimental results show that nearly all VLMs, even GPT-4o, exhibit limited performance in failure detection with a high false negative rate, meaning that they tend to ignore abnormal events, revealing notable gaps in current models' capacity to effectively handle failures.
Dataset and code can be found at https://anonymous.4open.science/r/FAER-3C53.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Vision question answering, multimodality
Contribution Types: Data resources, Data analysis
Languages Studied: English
Submission Number: 6151
Loading