What Can One Bad Tool Call Destroy? Measuring and Minimizing Blast Radius in Agentic Tool Use

Published: 23 May 2026, Last Modified: 23 May 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0
Keywords: agent safety, agent security, tool use, prompt injection, MCP, benchmark, blast radius, provenance, least privilege, rollback
TL;DR: DamageFlow measures not just whether tool-using agents are compromised, but how much operational damage one bad tool call can cause.
Abstract: Tool-using agents are usually evaluated by whether they complete benign tasks or succumb to malicious instructions. For deployment, this binary view is incomplete. The same attack-success label can hide radically different operational outcomes: a compromised agent might draft an incorrect message, leak sensitive data, delete repository files, mutate customer records, or propagate a malicious instruction to another agent. We introduce DamageFlow, a benchmark and evaluation framework for measuring the blast radius of tool-using agents under indirect prompt injection, MCP tool poisoning, tool shadowing, and multi-agent relay attacks. DamageFlow's current artifact implements sandboxed email, documents/RAG, repository, and MCP-style registry environments, specifies calendar, CRM, and filesystem extensions, and uses deterministic state-diff oracles that score committed tool effects rather than natural-language explanations. Its metric, Blast Radius Score (BRS), decomposes unauthorized tool effects by resource sensitivity, action impact, propagation, and recovery cost. We further propose BlastGuard, a provenance-aware runtime that compiles trusted user intent into least-privilege capability graphs, tracks untrusted context flow, and gates state-changing tool calls through dry-run, policy checks, scoped tokens, and rollback logs. In a controlled 80-scenario pilot benchmark, binary attack success obscures large differences in realized damage: two agents with the same 20% attack-success rate differ by 89x in BRS. BlastGuard reduces committed unauthorized damage in the pilot while retaining 87.5% benign task utility, and counterfactual scoring reports the BRS that was prevented at the tool boundary. Our results suggest that agent safety evaluation should measure not only whether agents fail, but how far failure can spread.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 190
Loading