Keywords: Prompt Injections, defense, tool-use
Abstract: Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment; this makes them susceptible to prompt injections when dealing with untrusted data. To overcome this limitation, we propose SIC (Soft Instruction Control) - a simple yet effective multi-stage sanitization pipeline designed for tool-augmented LLM agents. Our approach begins by unconditionally rewriting incoming data to neutralize any potential instructions by masking, rephrasing, or removing them. To detect attacks against the rewriter itself, we inject known canary instructions before this process; if these instructions survive, we conclude the rewrite was compromised. To account for the imprecision of LLMs, we apply multiple independent rewrite passes. Finally, a detection module inspects the full text and smaller chunks of the output for any residual instruction-like content. If imperative instructions remain, the agent halts to ensure security. This defense-in-depth strategy, combining unconditional rewriting, canary checking, and chunk-based detection, makes successful attacks significantly more difficult than bypassing a single detection model.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22590
Loading