Keywords: Code Security, Benchmark, Granularity, Robustness
Abstract: We introduce$\textbf{ZeroSecBench}$, a benchmark for fine-grained and robust evaluation of secure code generation in LLM-based AI copilots. Existing benchmarks are limited by $\textit{coarse-grained evaluation}$ that relies only on CWE categories---obscuring component and scenario-specific risks---and by $\textit{insufficient robustness}$ due to homogeneous, simplified samples.
ZeroSecBench contributes: (1) a $\textit{three-axis vulnerability taxonomy}$ that couples CWE with affected component and vulnerability scenario to enable component-aware analysis; and (2) a $\textit{robustness-oriented construction pipeline}$ with five augmentations (mask-position variation, unsafe-code distractors, grammatical traps, contextual noise, and leakage control). The benchmark contains 850 vulnerability instances mined from 150,000 real-world GitHub repositories, covering 12 CWEs and 46 Java components, with paired $\textit{autocomplete}$ and $\textit{instruct} $settings. We further provide a hybrid evaluation pipeline that combines syntax and functionality checks with LLM-as-judge security voting and dynamic proof-of-concept execution.
Across 11 state-of-the-art models, the best overall pass@1 is 0.26, and performance varies substantially across components even within the same CWE (e.g., SSRF components ranging from 0.10 to 1.00), underscoring the need for component-aware assessment. Compared to 13 prior benchmarks, ZeroSecBench achieves the highest quality score across ten design dimensions. ZeroSecBench establishes a rigorous foundation for measuring and advancing secure code generation in AI copilots.
Primary Area: datasets and benchmarks
Submission Number: 24130
Loading