Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Andy K Zhang; Neil Perry; Riya Dulepet; Joey Ji; Celeste Menders; Justin W Lin; Eliot Jones; Gashon Hussein; Samantha Liu; Donovan Julian Jasper; Pura Peetathawatchai; Ari Glenn; Vikram Sivashankar; Daniel Zamoshchin; Leo Glikbarg; Derek Askaryar; Haoxiang Yang; Aolin Zhang; Rishi Alluri; Nathan Tran; Rinnara Sangpisit; Kenny O Oseleononmen; Dan Boneh; Daniel E. Ho; Percy Liang

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Published: 22 Jan 2025, Last Modified: 11 Feb 2025ICLR 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Language Model Agents, Benchmark, Cybersecurity, Risk

TL;DR: Cybench is a cybersecurity agent benchmark with 40 professional-level Capture the Flag tasks that are recent, meaningful, and difficult with subtasks.

Abstract: Language Model (LM) agents for cybersecurity that are capable of autonomously identifying vulnerabilities and executing exploits have potential to cause real-world impact. Policymakers, model providers, and researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such agents to help mitigate cyberrisk and investigate opportunities for penetration testing. Toward that end, we introduce Cybench, a framework for specifying cybersecurity tasks and evaluating agents on those tasks. We include 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties. Each task includes its own description, starter files, and is initialized in an environment where an agent can execute commands and observe outputs. Since many tasks are beyond the capabilities of existing LM agents, we introduce subtasks for each task, which break down a task into intermediary steps for a more detailed evaluation. To evaluate agent capabilities, we construct a cybersecurity agent and evaluate 8 models: GPT-4o, OpenAI o1-preview, Claude 3 Opus, Claude 3.5 Sonnet, Mixtral 8x22b Instruct, Gemini 1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct. For the top performing models (GPT-4o and Claude 3.5 Sonnet), we further investigate performance across 4 agent scaffolds (structured bash, action-only, pseudoterminal, and web search). Without subtask guidance, agents leveraging Claude 3.5 Sonnet, GPT-4o, OpenAI o1-preview, and Claude 3 Opus successfully solved complete tasks that took human teams up to 11 minutes to solve. In comparison, the most difficult task took human teams 24 hours and 54 minutes to solve. Anonymized code and data are available at https://drive.google.com/file/d/1kp3H0pw1WMAH-Qyyn9WA0ZKmEa7Cr4D4 and https://drive.google.com/file/d/1BcTQ02BBR0m5LYTiK-tQmIK17_TxijIy.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5074

Loading