Keywords: Large Language Models, Vulnerability Detection, Benchmark, Count Bias, Software Security, Evaluation Metrics
TL;DR: We introduce MultiVulnBench, a large-scale benchmark revealing that LLMs suffer from a universal "count bias" where they systematically under-report software vulnerabilities as bug density increases.
Abstract: Large Language Models (LLMs) achieve near-perfect performance on
single-vulnerability detection yet suffer a systematic, underexplored
failure when files contain multiple co-located vulnerabilities:
recall collapses as vulnerability density grows, a phenomenon we term
$\textbf{count bias}$.
Existing benchmarks frame detection as binary classification of individual
functions and cannot expose this failure mode.
We present $\textbf{MultiVulnBench}$, the first large-scale benchmark designed to
measure count bias in LLM-based vulnerability detection.
MultiVulnBench comprises $\textbf{20,000 files}$ across four languages
(Python, C, C++, JavaScript) at five controlled density levels
($N \in \{0,1,3,5,9\}$ vulnerabilities/file), evaluated with five
state-of-the-art LLMs under zero-shot prompting.
We introduce the $\textbf{ExactFile}$ metric, the fraction of files where the
model identifies all vulnerabilities correctly, which captures
complete audit accuracy better than F1 alone.
Our central finding is that count bias is both universal and
catastrophic: at $N=9$, ExactFile accuracy falls to single digits for
every model and language, regardless of model size or family.
Mistral-3.2-24B achieves $F_1=0.974$ with $95.8\%$ ExactFile on
JavaScript at $N=1$; by $N=9$ this collapses to $F_1=0.577$ (-41%)
with ExactFile of $5.2\%$, meaning the model produces a complete, correct
audit less than 1 in 20 times.
All five models share the same failure signature: Precision stays near $1.0$
while Recall collapses, confirming a systematic under-prediction rather than
mis-classification.
Count error, measured by Mean Absolute Error on predicted vulnerability counts,
grows monotonically with $N$ for all models.
We additionally expose a dataset composition pathology, CWE homogeneity at
specific density levels, that inflates apparent performance and must be
controlled in future benchmark design.
Paper Type: Long (8 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 43
Loading