Keywords: agents; prompt injection; robustness
TL;DR: We analyze the structure of agents and what it means for prompt injection attacks. We create prompt injection attacks that match common use-patterns, and demonstrate that they work on production systems.
Abstract: Instruction-following LLM assistants that read untrusted data are susceptible to prompt injection, wherein a malicious actor injects a harmful request that the assistant naively complies with, to the user's detriment. We analyze the structure of tool-using LLM agents to create a descriptive framework for prompt injection attacks. By examining this framework, we find that certain attack modalities are understudied, and observe important trends in attack performance as we vary how prompt injection attacks are introduced and their token budget with practical takeaways. Importantly, previous work does not significantly explore the dimension of time, and we make the key finding that after being prompt-injected, many agents can behave benignly for 50+ conversation turns before taking a malicious action. Finally, we validate our work by executing sandboxed attacks against deployment systems such as Claude Code and Gemini-CLI. Our attacks readily succeed, and additionally reveal as-yet undocumented emergent behavior in these models' responses to prompt injection.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 24651
Loading