Soft Instruction De-escalation Defense

Soft Instruction De-escalation Defense

ICLR 2026 Conference Submission22590 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Prompt Injections, defense, tool-use

Abstract: Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment; this makes them susceptible to prompt injections when dealing with untrusted data. To overcome this limitation, we propose SIC (Soft Instruction Control) - a simple yet effective multi-stage sanitization pipeline designed for tool-augmented LLM agents. Our approach begins by unconditionally rewriting incoming data to neutralize any potential instructions by masking, rephrasing, or removing them. To detect attacks against the rewriter itself, we inject known canary instructions before this process; if these instructions survive, we conclude the rewrite was compromised. To account for the imprecision of LLMs, we apply multiple independent rewrite passes. Finally, a detection module inspects the full text and smaller chunks of the output for any residual instruction-like content. If imperative instructions remain, the agent halts to ensure security. This defense-in-depth strategy, combining unconditional rewriting, canary checking, and chunk-based detection, makes successful attacks significantly more difficult than bypassing a single detection model.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 22590

Loading