Guarded Tool-Using LLM Agents for Incident Response: A Safety-Gated Architecture and Operational Evaluation Protocol

Published: 01 Mar 2026, Last Modified: 24 Apr 2026ICLR 2026 AIWILDEveryoneRevisionsCC BY 4.0
Keywords: LLM agents, tool use, agent safety, prompt injection, incident response, SRE, safety gating, benchmarking, operational evaluation, audit traces
TL;DR: A safety-gated architecture and SRE-grounded evaluation scaffold for tool-using incident-response agents, with reproducible measurements of safety–utility tradeoffs under prompt injection.
Abstract: Tool-using large language model (LLM) agents show promise for automating on-call incident response (triage, diagnosis, and remediation), but incident response stresses agent systems in ways that standard benchmarks typically ignore. Actions have high impact, observability is partial, and attackers can inject instructions into tickets, logs, or tool outputs to steer an agent toward unsafe tool calls. This workshop paper presents two contributions. (1) Guarded Incident Response Agent (GIRA), an architecture that separates proposal generation from action authorization via a multi-layer safety gate that can block, rewrite (e.g., dry-run), or escalate tool calls. (2) Operational Evaluation Protocol (OEP), a benchmark-style evaluation scaffold that measures safety and utility using SRE-grounded metrics (time-to-detect, time-to-mitigate, blast radius) together with injection success rate and unauthorized action rate. We evaluate these ideas in a controlled, seeded simulator using a reproducible (mock) planner to isolate gate behavior. Across five incidents, four injection classes, nine agent configurations, and 2,250 runs, an unguarded agent frequently executes injected commands (ISR 0.640, blast 0.736). GIRA Full prevented any observed injection-driven execution of policy-violating tool calls in this setting (ISR 0.000) and substantially reduces blast radius (0.145), at the cost of slower mitigation (4.7 vs. 2.4 steps). We also find that gates built from policy checks and schema validation alone still permit tool-output injection (ISR 0.040), motivating injection-aware gating beyond typed tools.
PDF: pdf
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 240
Loading