EVIL-SAFE: A Benchmark for Embodied Vision-Language Safety Inspection by Free Exploration in Home Environment

EVIL-SAFE: A Benchmark for Embodied Vision-Language Safety Inspection by Free Exploration in Home Environment

ACL ARR 2026 January Submission5841 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Embodied Agent, Vision-Language Model, Home Safety Inspection, Benchmark

Abstract: Embodied agents can identify and report safety hazards in home settings. Accurately evaluating their ability to perform home safety checks is essential, yet current benchmarks have two major shortcomings. First, they oversimplify the task by using textual descriptions instead of visual inputs, hindering proper evaluation of vision-language model (VLM)-based agents. Second, they rely on a single static viewpoint, limiting exploration and potentially missing hazards that are occluded from fixed angles. To address these issues, we introduce EVIL-SAFE, a benchmark with 12,900 instances covering five common home hazards. EVIL-SAFE provides dynamic first-person view images from simulated home environments, offering multiple dynamic perspectives in complex settings by allowing embodied agents to freely explore rooms, thereby enabling more comprehensive inspection. Our evaluation of mainstream VLMs on EVIL-SAFE reveals significant limitations: even the top model achieves only a 10.23% F1 score, struggling particularly with hazard recognition and exploration planning. We hope EVIL-SAFE will support future research on home safety inspection.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: vision language navigation, vision question answering

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English

Submission Number: 5841

Loading