DiagVuln: A Holistic Conversational Benchmark for Evaluating LLMs on Vulnerability Assessment

ICLR 2026 Conference Submission8186 Authors

17 Sept 2025 (modified: 24 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vulnerability Benchmark, Large Language Model, CVE, RAG
TL;DR: A comprehensive benchmark dataset for vulnerability identification and assessment.
Abstract: With over 20,000 Common Vulnerabilities and Exposures (CVEs) reported an- nually, software vulnerabilities represent a critical cybersecurity challenge re- quiring automated assessment tools. While large language models (LLMs) show promise as cybersecurity assistants, existing benchmarks exhibit fundamental limitations: narrow data sources, neglect of contextual information, and focus on single-turn tasks rather than multi-turn analyst workflows. To bridge this gap, we introduce DiagVuln, the first multi-turn conversational benchmark for LLM-based vulnerability assessment. DiagVuln comprises 2,000 CVEs across 23 question-and-answer categories, encompassing detection, localization, classifi- cation, root cause analysis, exploit reasoning, impact assessment, and patch. We construct high-quality QA pairs in DiagVuln using retrieval-augmented gen- eration (RAG) based on data collected from diverse sources, validated through LLM-as-a-Judge and conformal prediction based on human expert annotations. Evaluation of five state-of-the-art LLMs using DiagVuln reveals substantial lim- itations, with low top-1 accuracy (below 60%) in vulnerable code detection and CVE identification. Our evaluation also demonstrates that current models lack critical reasoning capabilities for reliable vulnerability assessment. DiagVuln provides a valuable resource for advancing research in evaluating and fine-tuning LLMs for vulnerability assessment.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 8186
Loading