Keywords: Cybersecurity, AI, Agents
Abstract: AI agents have significant potential to reshape cybersecurity, making a thorough assessment of their capabilities critical.
However, existing evaluations fall short, because they are based on small-scale benchmarks and only measure static outcomes, failing to capture the full, dynamic range of real-world security challenges.
To address these limitations, we introduce CyberGym, a large-scale benchmark featuring 1,507 real-world vulnerabilities across 188 software projects.
Adjustable to different vulnerability analysis settings, CyberGym primarily tasks agents with generating a proof-of-concept test that reproduces a vulnerability, given only its text description and the corresponding codebase.
Our extensive evaluation highlights that CyberGym effectively differentiates agents' and models' cybersecurity capabilities.
Even the top-performing combinations only achieve a ~20% success rate, demonstrating the overall difficulty of CyberGym.
Beyond static benchmarking, we show that CyberGym leads to the discovery of 35 zero-day vulnerabilities and 17 historically incomplete patches.
These results underscore that CyberGym is not only a robust benchmark for measuring AI's progress in cybersecurity but also a platform for creating direct, real-world security impact.
Primary Area: datasets and benchmarks
Submission Number: 14517
Loading