ZeroSecBench: Fine-grained and Robust Evaluation for Secure Code Generation

Yunlong Lyu; Licheng Pan; Yiwen Xu; YuXuan Peng; Yunsheng Lu; Yifan Zhu; Weisen Chen; Jialan Yang; Junyao He; Xinyue Duan; TongSu; Zhixuan Chu; Kui Ren; Yukun Liu; Qi Li; Yukai Huang

ZeroSecBench: Fine-grained and Robust Evaluation for Secure Code Generation

Yunlong Lyu, Licheng Pan, Yiwen Xu, YuXuan Peng, Yunsheng Lu, Yifan Zhu, Weisen Chen, Jialan Yang, Junyao He, Xinyue Duan, TongSu, Zhixuan Chu, Kui Ren, Yukun Liu, Qi Li, Yukai Huang

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Code Security, Benchmark, Granularity, Robustness

Abstract: We introduce$\textbf{ZeroSecBench}$, a benchmark for fine-grained and robust evaluation of secure code generation in LLM-based AI copilots. Existing benchmarks are limited by $\textit{coarse-grained evaluation}$ that relies only on CWE categories---obscuring component and scenario-specific risks---and by $\textit{insufficient robustness}$ due to homogeneous, simplified samples. ZeroSecBench contributes: (1) a $\textit{three-axis vulnerability taxonomy}$ that couples CWE with affected component and vulnerability scenario to enable component-aware analysis; and (2) a $\textit{robustness-oriented construction pipeline}$ with five augmentations (mask-position variation, unsafe-code distractors, grammatical traps, contextual noise, and leakage control). The benchmark contains 850 vulnerability instances mined from 150,000 real-world GitHub repositories, covering 12 CWEs and 46 Java components, with paired $\textit{autocomplete}$ and $\textit{instruct} $settings. We further provide a hybrid evaluation pipeline that combines syntax and functionality checks with LLM-as-judge security voting and dynamic proof-of-concept execution. Across 11 state-of-the-art models, the best overall pass@1 is 0.26, and performance varies substantially across components even within the same CWE (e.g., SSRF components ranging from 0.10 to 1.00), underscoring the need for component-aware assessment. Compared to 13 prior benchmarks, ZeroSecBench achieves the highest quality score across ten design dimensions. ZeroSecBench establishes a rigorous foundation for measuring and advancing secure code generation in AI copilots.

Primary Area: datasets and benchmarks

Submission Number: 24130

Loading