Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?

Rishika Bhagwatkar; Kevin Kasa; Abhay Puri; Gabriel Huang; Irina Rish; Graham W. Taylor; Krishnamurthy Dj Dvijotham; Alexandre Lacoste

Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?

Rishika Bhagwatkar, Kevin Kasa, Abhay Puri, Gabriel Huang, Irina Rish, Graham W. Taylor, Krishnamurthy Dj Dvijotham, Alexandre Lacoste

Published: 27 Oct 2025, Last Modified: 27 Oct 2025NeurIPS Lock-LLM Workshop 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: indirect prompt injections, llm agent security, firewall defenses

Abstract: Tool-calling LLM agents are vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external content or tool outputs cause unintended or harmful behavior. We propose a simple, modular, model-agnostic defense based on two firewalls at the agent–tool boundary: a Tool-Input Firewall (Minimizer) and a Tool-Output Firewall (Sanitizer). Unlike prior complex approaches, our firewalls make minimal assumptions on the agent and can be deployed out-of-the-box, while maintaining strong performance. Our approach achieves either 0\% or the lowest attack success rate across four public benchmarks AgentDojo, InjecAgent, Tau-Bench, and Agent Security Bench, while maintaining high utility. However, our analysis also reveals critical limitations in these existing benchmarks, including flawed success metrics and implementation bugs that prevent agents from achieving full utility under attack. To foster more meaningful progress, we present targeted fixes to these issues for AgentDojo and Agent Security Bench while proposing best-practices for more robust benchmark design.

Submission Number: 68

Loading