EVIL-SAFE: A Benchmark for Embodied Vision-Language Safety Inspection by Free Exploration in Home Environment
Keywords: Embodied Agent, Vision-Language Model, Home Safety Inspection, Benchmark
Abstract: Embodied agents can identify and report safety hazards in home settings. Accurately evaluating their ability to perform home safety checks is essential, yet current benchmarks have two major shortcomings. First, they oversimplify the task by using textual descriptions instead of visual inputs, hindering proper evaluation of vision-language model (VLM)-based agents. Second, they rely on a single static viewpoint, limiting exploration and potentially missing hazards that are occluded from fixed angles.
To address these issues, we introduce EVIL-SAFE, a benchmark with 12,900 instances covering five common home hazards. EVIL-SAFE provides dynamic first-person view images from simulated home environments, offering multiple dynamic perspectives in complex settings by allowing embodied agents to freely explore rooms, thereby enabling more comprehensive inspection. Our evaluation of mainstream VLMs on EVIL-SAFE reveals significant limitations: even the top model achieves only a 10.23% F1 score, struggling particularly with hazard recognition and exploration planning. We hope EVIL-SAFE will support future research on home safety inspection.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision language navigation, vision question answering
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 5841
Loading