SecTest-Eval: Can LLMs Verify Security Impacts of A Vulnerability?

Jiayu Liu; Yanjun Wu; Tiejian Luo

SecTest-Eval: Can LLMs Verify Security Impacts of A Vulnerability?

Jiayu Liu, Yanjun Wu, Tiejian Luo

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models (LLMs), Security Impact, CIA Triad, SecTestCase, Vulnerable Functions

Abstract: As Large Language Models (LLMs) have demonstrated capabilities in exploiting software vulnerabilities, the potential misuse of LLMs in conducting cyberattacks highlights the urgent need for benchmarks to capture the frontier of their capabilities. Existing benchmarks primarily evaluate LLMs from a global perspective, where LLMs are tasked to generate exploits that call vulnerable code (e.g. function) from project entry points, and reveal significant performance gaps. Therefore, recent studies have explored decomposing the whole challenging exploit generation task into a series of relatively simple tasks, applying LLMs from a local perspective, particularly for generating exploits that directly call vulnerable functions. While such attempts have shown effectiveness, existing benchmarks may lead to unreliable model performance in these scenarios due to low label accuracy for vulnerable functions. To address this, we introduce SecTest-Eval, the first benchmark for evaluating LLMs in exploit generation form a local perspective, where LLMs are tasked to generate exploits that directly call vulnerable functions. SecTest-Eval incorporates a novel automated data labeling method achieving accurate vulnerable function annotation and features a sandbox framework that automatically evaluates generated exploits by monitoring unauthorized data access, data modification, and denial-of-service. Our evaluations show that, even from a local perspective, current LLMs still face challenge in exploit generation, achieving at most 56% success rate. Furthermore, we find that Chain-of-thought prompting yields no significant improvement, while integrating LLMs into security-oriented agents improves success rates by 7.5%. These findings underscore the effectiveness of SecTest-Eval and suggest that enhancing LLMs' capabilities in exploit generation requires either training on specialized datasets or incorporating security-specific tools.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 9060

Loading