Keywords: Vulnerability Benchmark, Large Language Model, CVE, RAG
TL;DR: A comprehensive benchmark dataset for vulnerability identification and assessment.
Abstract: With over 20,000 Common Vulnerabilities and Exposures (CVEs) reported an-
nually, software vulnerabilities represent a critical cybersecurity challenge re-
quiring automated assessment tools. While large language models (LLMs) show
promise as cybersecurity assistants, existing benchmarks exhibit fundamental
limitations: narrow data sources, neglect of contextual information, and focus
on single-turn tasks rather than multi-turn analyst workflows. To bridge this
gap, we introduce DiagVuln, the first multi-turn conversational benchmark for
LLM-based vulnerability assessment. DiagVuln comprises 2,000 CVEs across 23
question-and-answer categories, encompassing detection, localization, classifi-
cation, root cause analysis, exploit reasoning, impact assessment, and patch. We
construct high-quality QA pairs in DiagVuln using retrieval-augmented gen-
eration (RAG) based on data collected from diverse sources, validated through
LLM-as-a-Judge and conformal prediction based on human expert annotations.
Evaluation of five state-of-the-art LLMs using DiagVuln reveals substantial lim-
itations, with low top-1 accuracy (below 60%) in vulnerable code detection and
CVE identification. Our evaluation also demonstrates that current models lack
critical reasoning capabilities for reliable vulnerability assessment. DiagVuln
provides a valuable resource for advancing research in evaluating and fine-tuning
LLMs for vulnerability assessment.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 8186
Loading