SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents

SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents

ICLR 2026 Conference Submission16964 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: embodied llm agents, safety-aware task planning, safety-related benchmark

TL;DR: We introduce SafeAgentBench,the first comprehensive benchmark for safety-aware task planning of embodied LLM agents in interactive simulation, covering both explicit and implicit hazards, which reveals that current agents remain largely unsafe.

Abstract: With the integration of large language models (LLMs), embodied agents have strong capabilities to understand and plan complicated natural language instructions. However, a foreseeable issue is that those embodied agents can also flawlessly execute some hazardous tasks, potentially causing damages in the real world. Existing benchmarks predominantly overlook critical safety risks, focusing solely on planning performance, while a few evaluate LLMs' safety awareness only on non-interactive image-text data. To address this gap, we present \textbf{SafeAgentBench}—the first comprehensive benchmark for safety-aware task planning of embodied LLM agents in interactive simulation environments, covering both explicit and implicit hazards. SafeAgentBench includes: (1) an executable, diverse, and high-quality dataset of 750 tasks, rigorously curated to cover 10 potential hazards and 3 task types; (2) SafeAgentEnv, a universal embodied environment with a low-level controller, supporting multi-agent execution with 17 high-level actions for 9 state-of-the-art baselines; and (3) reliable evaluation methods from both execution and semantic perspectives. Experimental results show that, although agents based on different design frameworks exhibit substantial differences in task success rates, their overall safety awareness remains weak. The most safety-conscious baseline achieves only a 10\% rejection rate for detailed hazardous tasks. Moreover, simply replacing the LLM driving the agent does not lead to notable improvements in safety awareness. Dataset and codes are available and shown in the reproducibility statement.

Supplementary Material: zip

Primary Area: applications to robotics, autonomy, planning

Submission Number: 16964

Loading