CVE-Bench: A Benchmark for AI Agents’ Ability to Exploit Real-World Web Application Vulnerabilities

Yuxuan Zhu; Antony Kellermann; Dylan Bowman; Philip Li; Akul Gupta; Adarsh Danda; Richard Fang; Conner Jensen; Eric Ihli; Jason Benn; Jet Geronimo; Avi Dhir; Sudhit Rao; Kaicheng Yu; Twm Stone; Daniel Kang

CVE-Bench: A Benchmark for AI Agents’ Ability to Exploit Real-World Web Application Vulnerabilities

Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, Jet Geronimo, Avi Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, Daniel Kang

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We introduce a cybersecurity benchmark for evaluating the capability of AI agents in exploiting real-world vulnerabilities of web applications.

Abstract: Large language model (LLM) agents are increasingly capable of autonomously conducting cyberattacks, posing significant threats to existing applications. This growing risk highlights the urgent need for a real-world benchmark to evaluate the ability of LLM agents to exploit web application vulnerabilities. However, existing benchmarks fall short as they are limited to abstracted Capture-the-Flag competitions or lack comprehensive coverage. Building a benchmark for real-world vulnerabilities involves both specialized exper- tise to reproduce exploits and a systematic approach to evaluating unpredictable attacks. To address this challenge, we introduce CVE-Bench, a real-world cybersecurity benchmark based on critical-severity Common Vulnerabilities and Exposures. In CVE-Bench, we design a sandbox framework that enables LLM agents to exploit vulnerable web applications in scenarios that mimic real-world conditions, while also providing effective evaluation of their exploits. Our experiments show that the state-of-the-art agent framework can exploit up to 13% of the vulnerabilities.

Lay Summary: Modern artificial intelligence systems (e.g., ChatGPT) are increasingly able to write code and even hack computer programs on their own. As a result, security researchers need a reliable way to see how dangerous these AI-powered hackers really are. Unfortunately, current benchmarks usually rely on toy games or cover only a limited number of cybersecurity vulnerabilities. They do not reflect the complexity and diversity of real-world web applications. We introduce CVE-Bench, a benchmark built from real-world web applications. Each application contains a critical-security vulnerability that actually happened. We developed a secure sandbox framework for these applications to simulate real-world scenarios and evaluate AI systems automatically. In our experiments, we find that state-of-the-art AI systems succeeded 13% of the time. This means current AI systems are capable of hacking some real vulnerabilities, but they still miss most of them. CVE-Bench now gives researchers a realistic, repeatable way to track that risk and to improve defenses before AI hackers grow more effective.

Link To Code: https://github.com/uiuc-kang-lab/cve-bench

Primary Area: Applications->Everything Else

Keywords: benchmark, cybersecurity, llm, agent

Submission Number: 5058

Loading