Keywords: Vulnerability Detection, Formal Verification, Synthetic Datasets
TL;DR: ApproxVul is a dataset and framework that uses Large Language Models to mutate real-world code snippets into complete, verifiable programs, creating a large-scale, realistic dataset to overcome the data quality limitations of existing benchmarks.
Abstract: The advancement of machine learning for vulnerability detection is critically hampered by the absence of datasets that are simultaneously large-scale, accurately labeled, and realistic. Existing benchmarks impose a trade-off: large, real-world datasets suffer from noisy labels and contamination, while manually curated datasets are too small to train robust models. Synthetic datasets, although formally verifiable, typically lack the structural complexity of production code, resulting in a significant "realism gap". To overcome these limitations, we introduce ApproxVul, a framework and dataset that unites real-world code realism with the mathematical certainty of formal verification. Our framework leverages Large Language Models (LLMs) to systematically mutate code snippets from real-world projects, introducing a diverse range of subtle and complex vulnerabilities. Each resulting program is then formally verified to establish ground-truth labels, eliminating label noise. This process yields ApproxVul, a new dataset of over 104,000 compilable and verifiable programs, featuring minimally different vulnerable/safe pairs derived from real-world code. Through comprehensive evaluation, we demonstrate that ApproxVul achieves better cross-dataset generalization than purely synthetic training datasets and slightly outperforms noisy real-world training data. While fine-tuning alone remains insufficient for project-level generalization, ApproxVul's inter-procedural and verifiable structure makes it a crucial stepping stone toward more advanced vulnerability detection approaches.
Primary Area: datasets and benchmarks
Submission Number: 17482
Loading