Adaptive Adversaries: A Multi-Turn, Multi-LLM Benchmark for LLM Agent Security
Keywords: Safety benchmark, LLM agents, security, multi-turn jailbreaking, prompt injection, red-teaming, adversarial robustness, attack success rate
Abstract: LLM-based agents process external content, exposing them to prompt injection and multi-turn manipulation. Most safety benchmarks evaluate defenders against fixed attack pools collected before evaluation, single-turn or multi-turn. We present a 21-scenario benchmark for adaptive multi-round attacks against memoryless LLM defenders: an autonomous LLM attacker observes prior defender responses and pivots across rounds, while each defender response is evaluated as a fresh interaction. Holding the 21 scenarios, attackers, defenders, and structured-output scoring fixed, restricting scoring to the first attacker turn yields $0$--$1%$ attack success rate (ASR); allowing 15 rounds of adaptive attack yields $8.3$--$19.0%$. Pooling three frontier attacker LLMs uncovers up to $2.4\times$ as many unique successful attacks as the best single attacker, and the generated attacks have low cosine similarity ($0.02$--$0.14$) to attacks in existing benchmarks. Claude Opus 4.6 and GPT-5.4 are aggregate-equivalent ($8.3%$ vs. $11.1%$ ASR), but their weaknesses differ sharply: on one scenario Opus reaches $60%$ ASR ($95%$ CI $36$--$80%$) while GPT-5.4 and Gemini each stay at $7%$ (CI $1$--$30%$). $15$ of $21$ scenarios distinguish at least one defender pair, yet rankings disagree across scenarios (Kendall's $W = 0.21$). We release the benchmark---21 evaluation scenarios, 10 public development scenarios, the orchestrator, baseline harnesses, and a multi-attacker CLI---plus 945 transcripts from the $3\times3$ frontier matrix, an attack-replay dataset, and $18{,}422$ battles from an open competition's final scoring rounds.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 223
Loading