Bridging the Realism Gap: Generating Formally Verified Vulnerability Datasets from Real-World Code

Bridging the Realism Gap: Generating Formally Verified Vulnerability Datasets from Real-World Code

ICLR 2026 Conference Submission17482 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vulnerability Detection, Formal Verification, Synthetic Datasets

TL;DR: ApproxVul is a dataset and framework that uses Large Language Models to mutate real-world code snippets into complete, verifiable programs, creating a large-scale, realistic dataset to overcome the data quality limitations of existing benchmarks.

Abstract: The advancement of machine learning for vulnerability detection is critically hampered by the absence of datasets that are simultaneously large-scale, accurately labeled, and realistic. Existing benchmarks impose a trade-off: large, real-world datasets suffer from noisy labels and contamination, while manually curated datasets are too small to train robust models. Synthetic datasets, although formally verifiable, typically lack the structural complexity of production code, resulting in a significant "realism gap". To overcome these limitations, we introduce ApproxVul, a framework and dataset that unites real-world code realism with the mathematical certainty of formal verification. Our framework leverages Large Language Models (LLMs) to systematically mutate code snippets from real-world projects, introducing a diverse range of subtle and complex vulnerabilities. Each resulting program is then formally verified to establish ground-truth labels, eliminating label noise. This process yields ApproxVul, a new dataset of over 104,000 compilable and verifiable programs, featuring minimally different vulnerable/safe pairs derived from real-world code. Through comprehensive evaluation, we demonstrate that ApproxVul achieves better cross-dataset generalization than purely synthetic training datasets and slightly outperforms noisy real-world training data. While fine-tuning alone remains insufficient for project-level generalization, ApproxVul's inter-procedural and verifiable structure makes it a crucial stepping stone toward more advanced vulnerability detection approaches.

Primary Area: datasets and benchmarks

Submission Number: 17482

Loading