Guardian Angels in the Wild: Verification-First LLM Planning for Safety-Critical Daily Life Tasks

Published: 01 Mar 2026, Last Modified: 24 Apr 2026ICLR 2026 AIWILDEveryoneRevisionsCC BY 4.0
Keywords: LLM agents, planning, safety-critical systems, verification, simulation, benchmarks, Security, human-centered AI, LLM-as-a-judge
TL;DR: We introduce a verification-first LLM planning loop with deterministic simulator/validator checks for safety-critical, multi-task daily life planning.
Abstract: Large language models (LLMs) increasingly act as planners and agents, yet most evaluations remain confined to closed-world assumptions that break under dynamic real-world constraints. We study Guardian Angels: LLM agents that coordinate daily multi-task plans while issuing high-level commands to safety-critical physical devices such as autonomous vehicles and automated insulin delivery systems. In the wild, plausible plans can still be unsafe or physically infeasible due to interacting constraints, context drift, and tool brittleness. We introduce a 200-scenario benchmark spanning four domains (autonomous vehicles, automated insulin delivery, home multi-device planning, and meeting management), each paired with explicit dependencies, priorities, personalization requirements, replanning triggers, and deterministic simulators. We propose a verification-first agent loop in which the LLM emits a complete structured plan in JSON, a simulator/validator checks safety and feasibility prior to any execution, and unsafe plans trigger bounded repair or fail-safe escalation. Experiments with frontier LLMs show strong performance on Easy/Medium scenarios but sharp degradation on Hard multi-device settings. Finally, we audit automated evaluation and find that LLM-as-a-judge aligns with humans on easy tasks but systematically overestimates plan quality on hard safety-critical scenarios, motivating verifier-grounded evaluation and hybrid auditing for agents in the wild.
PDF: pdf
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 161
Loading