Beyond Single Bugs: Benchmarking Large Language Models for Multi-Vulnerability Detection

ACL ARR 2026 January Submission6137 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multi-vulnerability detection, long-context code understanding, software vulnerability detection, multi-label classification, robustness evaluation, count bias, selection bias, vulnerability detection benchmark
Abstract: Large Language Models (LLMs) have demonstrated significant potential in automated software security, particularly in vulnerability detection. However, existing benchmarks primarily focus on isolated, single-vulnerability samples or function-level classification, failing to reflect the complexity of real-world software where multiple interacting vulnerabilities often coexist within large files. Recent studies indicate that LLMs suffer from "count bias" and "selection bias" in multi-label tasks, yet this has not been rigorously quantified in the domain of code security. In this work, we introduce a comprehensive benchmark for Multi-Vulnerability Detection across four major languages: C, C++, Python, and JavaScript. We construct a dataset of 40,000 files by systematically injecting controlled counts of vulnerabilities (1, 3, 5, and 9) into long-context code samples (7.5k–10k tokens) sourced from CodeParrot. We evaluate five state-of-the-art LLMs, including GPT-4o-mini, Llama-3.3-70B, and the Qwen-2.5 series. Our results reveal a sharp degradation in performance as vulnerability density increases. While Llama-3.3-70B achieves near-perfect F1 scores (~0.97) on single-vulnerability C tasks, performance drops by up to 40\% in high-density settings. Notably, Python and JavaScript show distinct failure modes compared to C/C++, with models exhibiting severe "under-counting" (Recall dropping to <0.30) in complex Python files.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: vulnerability detection, evaluation of code models, code understanding, multi-language code models, repository modeling, safety and reliability of code models, code reasoning
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 6137
Loading