CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization
Keywords: Indirect Prompt Injection Defense, AI Agent Security
TL;DR: novel and practical defense for indirect prompt injection using precise tool output sanitization
Abstract: The increasing adoption of LLM agents with access to numerous tools and sensitive data significantly widens the attack surface for indirect prompt injections. Current defenses struggle with context-dependent attacks and cannot reliably differentiate malicious from benign instructions, leading to high false positive rates that prevent real-world adoption. To address this, we present a novel approach inspired by a fundamental principle of computer security: data should not contain executable instructions. Instead of classifying tool outputs as malicious or benign—a hard, low-resource, and ill-defined objective—we reframe the sanitization task entirely. We propose a _token-level sanitization process_ that surgically removes _any instructions directed at AI systems_ from untrusted tool outputs, capturing malicious instructions as a byproduct without needing to judge maliciousness. While this over-approximation may appear restrictive, our experiments show that it does not harm agent utility. In contrast to existing safety classifiers, this approach is non-blocking, requires no calibration, and is context-agnostic. Further, we train such token-level predictors using only readily available instruction-tuning data, without relying on unrealistic prompt injection examples. Across a wide range of attacks and benchmarks, including AgentDojo, BIPIA, InjecAgent, ASB, and SEP, our method achieves a 7–19× reduction in attack success rate (from 34% to 3% on AgentDojo) without impairing agent utility in either benign or malicious settings.
PDF: pdf
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 67
Loading