Keywords: Agent, Security, Prompt Injection
Abstract: LLM agents are inherently vulnerable to *prompt injection attacks*. A simple and easy-to-deploy baseline defense is to directly prompt an *off-the-shelf* LLM to detect injected prompts; however, prior work has shown this approach to be largely ineffective. Importantly, these results were based on older LLMs with weaker reasoning capabilities. In this work, we revisit this idea in light of the strong reasoning capabilities of modern LLMs. The results show that, with a carefully designed system prompt, our PromptArmor can accurately *detect* and *remove* injected prompts by directly prompting a modern LLM. For example, PromptArmor using GPT-4o achieves both a *false positive rate* and a *false negative rate* below 1% on the AgentDojo benchmark, and below 5% on Open Prompt Injection and TensorTrust. We further evaluate PromptArmor against adaptive attacks and investigate alternative prompting strategies. Overall, our work shows that the previous conception of this approach as ineffective is no longer the case, and that prompting a strong, off-the-shelf LLM should now be regarded as a standard baseline for evaluating defenses against prompt injection.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 4931
Loading