SECODEPLT: A Unified Benchmark for Evaluating the Security Risks and Capabilities of Code GenAI

Yuzhou Nie; Zhun Wang; Yu Yang; Ruizhe Jiang; Yuheng Tang; Xander Davies; Yarin Gal; Bo Li; Wenbo Guo; Dawn Song

SECODEPLT: A Unified Benchmark for Evaluating the Security Risks and Capabilities of Code GenAI

Yuzhou Nie, Zhun Wang, Yu Yang, Ruizhe Jiang, Yuheng Tang, Xander Davies, Yarin Gal, Bo Li, Wenbo Guo, Dawn Song

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Code Generation, AI Security, Computer Security

TL;DR: We create a unified benchmark for evaluating secure code generation, vulnerability detection and poc generation

Abstract: Existing benchmarks for evaluating the security risks and capabilities (e.g., vulnerability detection) of code-generating large language models (LLMs) face several key limitations: (1) limited coverage of risk and capabilities; (2) reliance on static evaluation metrics such as LLM judgments or rule-based detection, which lack the precision of dynamic analysis; and (3) a trade-off between data quality and benchmark scale. To address these challenges, we introduce a general and scalable benchmark construction framework that begins with manually validated, high-quality seed examples and expands them via targeted mutations. Each mutated sample retains the seed’s security semantics while providing diverse, unseen instances. The resulting benchmark bundles every artifact required for dynamic evaluation, including prompts, vulnerable and patched code, test cases, and ground-truth proofs of concept, enabling rigorous measurement of insecure coding, vulnerability detection, and patch generation. Applying this framework to Python, C/C++, and Java, we build SECODEPLT, a dataset of more than 5.9k samples spanning 44 CWE-based risk categories and three security capabilities. Compared with state-of-the-art benchmarks, SECODEPLT offers broader coverage, higher data fidelity, and substantially greater scale. We use SECODEPLT to evaluate leading code-generation LLMs and agents, revealing their strengths and weaknesses in both generating secure code and identifying or fixing vulnerabilities. We provide our code in \url{https://github.com/ucsb-mlsec/SeCodePLT}, data in \url{https://huggingface.co/datasets/UCSB-SURFI/SeCodePLT}

Croissant File: json

Dataset URL: https://huggingface.co/datasets/UCSB-SURFI/SeCodePLT

Code URL: https://github.com/ucsb-mlsec/SeCodePLT

Supplementary Material: pdf

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Submission Number: 1511

Loading