CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

Debeshee Das; Luca Beurer-Kellner; Marc Fischer; Maximilian Baader

CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

Debeshee Das, Luca Beurer-Kellner, Marc Fischer, Maximilian Baader

Published: 01 Mar 2026, Last Modified: 24 Apr 2026ICLR 2026 AIWILDEveryoneRevisionsCC BY 4.0

Keywords: Indirect Prompt Injection Defense, AI Agent Security

TL;DR: novel and practical defense for indirect prompt injection using precise tool output sanitization

Abstract: The increasing adoption of LLM agents with access to numerous tools and sensitive data significantly widens the attack surface for indirect prompt injections. Current defenses struggle with context-dependent attacks and cannot reliably differentiate malicious from benign instructions, leading to high false positive rates that prevent real-world adoption. To address this, we present a novel approach inspired by a fundamental principle of computer security: data should not contain executable instructions. Instead of classifying tool outputs as malicious or benign—a hard, low-resource, and ill-defined objective—we reframe the sanitization task entirely. We propose a _token-level sanitization process_ that surgically removes _any instructions directed at AI systems_ from untrusted tool outputs, capturing malicious instructions as a byproduct without needing to judge maliciousness. While this over-approximation may appear restrictive, our experiments show that it does not harm agent utility. In contrast to existing safety classifiers, this approach is non-blocking, requires no calibration, and is context-agnostic. Further, we train such token-level predictors using only readily available instruction-tuning data, without relying on unrealistic prompt injection examples. Across a wide range of attacks and benchmarks, including AgentDojo, BIPIA, InjecAgent, ASB, and SEP, our method achieves a 7–19× reduction in attack success rate (from 34% to 3% on AgentDojo) without impairing agent utility in either benign or malicious settings.

PDF: pdf

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 67

Loading