Keywords: indirect prompt injections, llm agent security, firewall defenses
Abstract: Tool-calling LLM agents are vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external content or tool outputs cause unintended or harmful behavior. We propose a simple, modular, model-agnostic defense based on two firewalls at the agent–tool boundary: a Tool-Input Firewall (Minimizer) and a Tool-Output Firewall (Sanitizer). Unlike prior complex approaches, our firewalls make minimal assumptions on the agent and can be deployed out-of-the-box, while maintaining strong performance. Our approach achieves either 0\% or the lowest attack success rate across four public benchmarks AgentDojo, InjecAgent, Tau-Bench, and Agent Security Bench, while maintaining high utility. However, our analysis also reveals critical limitations in these existing benchmarks, including flawed success metrics and implementation bugs that prevent agents from achieving full utility under attack. To foster more meaningful progress, we present targeted fixes to these issues for AgentDojo and Agent Security Bench while proposing best-practices for more robust benchmark design.
Submission Number: 68
Loading