TL;DR: We introduce a cybersecurity benchmark for evaluating the capability of AI agents in exploiting real-world vulnerabilities of web applications.
Abstract: Large language model (LLM) agents are increasingly capable of autonomously conducting cyberattacks, posing significant threats to existing applications. This growing risk highlights the urgent need for a real-world benchmark to evaluate the ability of LLM agents to exploit web application vulnerabilities. However, existing benchmarks fall short as they are limited to abstracted Capture-the-Flag competitions or lack comprehensive coverage. Building a benchmark for real-world vulnerabilities involves both specialized exper-
tise to reproduce exploits and a systematic approach to evaluating unpredictable attacks. To address this challenge, we introduce CVE-Bench, a real-world cybersecurity benchmark based on critical-severity Common Vulnerabilities and Exposures. In CVE-Bench, we design a sandbox framework that enables LLM agents to exploit vulnerable web applications in scenarios that mimic real-world conditions, while also providing effective evaluation of their exploits. Our experiments show that the state-of-the-art agent framework can exploit up to 13% of the vulnerabilities.
Lay Summary: Modern artificial intelligence systems (e.g., ChatGPT) are increasingly able to write code and even hack computer programs on their own. As a result, security researchers need a reliable way to see how dangerous these AI-powered hackers really are. Unfortunately, current benchmarks usually rely on toy games or cover only a limited number of cybersecurity vulnerabilities. They do not reflect the complexity and diversity of real-world web applications.
We introduce CVE-Bench, a benchmark built from real-world web applications. Each application contains a critical-security vulnerability that actually happened. We developed a secure sandbox framework for these applications to simulate real-world scenarios and evaluate AI systems automatically.
In our experiments, we find that state-of-the-art AI systems succeeded 13% of the time. This means current AI systems are capable of hacking some real vulnerabilities, but they still miss most of them. CVE-Bench now gives researchers a realistic, repeatable way to track that risk and to improve defenses before AI hackers grow more effective.
Link To Code: https://github.com/uiuc-kang-lab/cve-bench
Primary Area: Applications->Everything Else
Keywords: benchmark, cybersecurity, llm, agent
Submission Number: 5058
Loading