AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 oralEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We introduce a benchmark that measures the ability of LLMs to automatically exploit adversarial examples, and show that current LLMs struggle at this real-world task.
Abstract: We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing security benchmarks that often serve as proxies for real-world tasks, AutoAdvExBench directly measures LLMs' success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solve the challenges presented in AutoAdvExBench, it would immediately present practical utility for adversarial machine learning researchers. While our strongest ensemble of agents can break 87% of CTF-like ("homework exercise") adversarial example defenses, they break just 37% of real-world defenses, indicating a large gap between difficulty in attacking "real" code, and CTF-like code. Moreover, LLMs that are good at CTFs are not always good at real-world defenses; for example, Claude Sonnet 3.5 has a nearly identical attack success rate to Opus 4 on the CTF-like defenses (75% vs 79%), but the on the real-world defenses Sonnet 3.5 breaks just 13% of defenses compared to Opus 4's 30%. We make this benchmark available at https://github.com/ethz-spylab/AutoAdvExBench.
Lay Summary: We introduce AutoAdvExBench, a benchmark designed to evaluate whether large language models (LLMs) can autonomously identify and exploit vulnerabilities in adversarial defense systems. Adversarial defenses are defenses that protect machine learning systems from specially crafted inputs meant to fool them. For example, someone might subtly modify an image so that an image classifier misidentifies it. Our benchmark tests whether LLMs can find ways around these protective measures, and automate the work that human security researchers do manually. Our findings reveal an interesting pattern. When we tested our strongest combination of LLM agents on CTF-like challenges ("Capture The Flag" challenges are practice exercises similar to homework problems that security professionals use for training), the LLMs successfully broke 87% of the defenses. However, when we moved to real-world defense systems that researchers actually wrote for research papers, the best LLMs' success rate dropped significantly to just 37%. This gap highlights how much harder it is to attack real systems compared to educational exercises. We also discovered that performance doesn't transfer predictably between these two settings. For example, we tested two advance language models, Claude Sonnet 3.5 and Opus 4 (where Opus 4 is the more advanced model). On the CTF-like challenges, both performed similarly—Sonnet 3.5 broke 75% of defenses while Opus 4 broke 79%. But on real-world defenses, the difference was dramatic: Sonnet 3.5 succeeded only 13% of the time, while Opus 4 managed 30%. The benchmark is publicly available at https://github.com/ethz-spylab/AutoAdvExBench for other researchers to use and build upon.
Link To Code: https://github.com/ethz-spylab/AutoAdvExBench
Primary Area: Social Aspects->Security
Keywords: benchmark, adversarial examples, agents
Submission Number: 7217
Loading