Keywords: LLM safety, LLM agents, jailbreak
TL;DR: A single adversarial string can propagate among agents, starting with one harmful agent and ultimately compromising all agents.
Abstract: LLM-powered agents augmented with memory, retrieval, and the ability to call external tools have demonstrated significant potential in augmenting human productivity. However, the fact that these models are vulnerable to adversarial attacks and other forms of "jailbreaking" raises concerns about safety and misuse, particularly when agents are granted autonomy. We initiate the study of these vulnerabilities in multi-agent, multi-round settings, where a collection of LLM-powered agents repeatedly exchange messages to complete a task. Focusing on the case where a single agent is initially exposed to an adversarial input, we aim to understand when this can lead to the eventual compromise of *all* agents in the collection via transmission of adversarial strings in subsequent messages. We show that this requires the ability to find an initial self-propagating input that will induce agents to repeat it with high probability relative to the contents of their memory---i.e., one that *generalizes* well across contexts. We propose a new attack called Generalizable Infectious Gradient Attack (GIGA), and show that it is successful across varied experimental settings that aim to **1)** propagate an attack suffix across large collections of models, and **2)** bypass a prompt-rewriting defense for adversarial examples, whereas existing attack methods often struggle to identify such inputs.
Serve As Reviewer: wyu3@andrew.cmu.edu
Submission Number: 34
Loading