Guardian: Detecting Robotic Planning and Execution Errors with Vision-Language Models

Published: 06 Sept 2025, Last Modified: 26 Sept 2025CoRL 2025 Robot Data WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Robotic Manipulation, Failure Detection, Vision-Language Models
TL;DR: Novel framework for detecting robotic planning and execution errors that leverages a unified VLM, trained on newly generated failure robotic data, enabling accurate identification of both task planning and action execution errors.
Abstract: Robust robotic manipulation requires the ability to detect and recover from failures. While recent Vision-Language Models (VLMs) show promise in detecting manipulation failures, their effectiveness is hindered by limited training data and reliance on single images, restricting their ability to capture fine-grained failure modes and generalize to real-world scenarios. In this work, we introduce a new VLM named Guardian which leverages high-resolution, multi-view visual observations combined with carefully designed language prompts to enhance manipulation failure detection. We propose an automatic failure synthesis pipeline that perturbs successful trajectories in both simulator and real world to generate a diverse set of failure scenarios, covering both task planning and action execution errors. This enables the creation of two new benchmarks: RLBench-Fail and BridgeDataV2-Fail for training and evaluation. Guardian achieves state-of-the-art performance on both benchmarks and demonstrates strong generalization to the manually created real-world RoboFail and UR5-Fail benchmarks. Furthermore, plugging Guardian into a state-of-the-art vision-language manipulation framework improves task success rates in both simulation and real robots.
Supplementary Material: zip
Lightning Talk Video: mp4
Submission Number: 16
Loading