SecTest-Eval: Can LLMs Verify Security Impacts of A Vulnerability?

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models (LLMs), Security Impact, CIA Triad, SecTestCase, Vulnerable Functions
Abstract: As Large Language Models (LLMs) have demonstrated capabilities in exploiting software vulnerabilities, the potential misuse of LLMs in conducting cyberattacks highlights the urgent need for benchmarks to capture the frontier of their capabilities. Existing benchmarks primarily evaluate LLMs from a global perspective, where LLMs are tasked to generate exploits that call vulnerable code (e.g. function) from project entry points, and reveal significant performance gaps. Therefore, recent studies have explored decomposing the whole challenging exploit generation task into a series of relatively simple tasks, applying LLMs from a local perspective, particularly for generating exploits that directly call vulnerable functions. While such attempts have shown effectiveness, existing benchmarks may lead to unreliable model performance in these scenarios due to low label accuracy for vulnerable functions. To address this, we introduce SecTest-Eval, the first benchmark for evaluating LLMs in exploit generation form a local perspective, where LLMs are tasked to generate exploits that directly call vulnerable functions. SecTest-Eval incorporates a novel automated data labeling method achieving accurate vulnerable function annotation and features a sandbox framework that automatically evaluates generated exploits by monitoring unauthorized data access, data modification, and denial-of-service. Our evaluations show that, even from a local perspective, current LLMs still face challenge in exploit generation, achieving at most 56% success rate. Furthermore, we find that Chain-of-thought prompting yields no significant improvement, while integrating LLMs into security-oriented agents improves success rates by 7.5%. These findings underscore the effectiveness of SecTest-Eval and suggest that enhancing LLMs' capabilities in exploit generation requires either training on specialized datasets or incorporating security-specific tools.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 9060
Loading