GitPatchDB: A Large-Scale GitHub Commit Databank for Vulnerability Patch Analysis

ICLR 2026 Conference Submission13341 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vulnerability Databases, Vulnerability Patch Search, Patch Commits, Program Slicing, Contrastive Representation Learning, Software Security
TL;DR: We propose GitPatchDB, a semantically rich dataset for vulnerability patch search with program slicing techniques, and Contrastive Natural-language Programming-language Pre-training (CNPP) that leverages it to achieve state-of-the-art performance.
Abstract: Machine learning based vulnerability detection relies on datasets that link vulnerabilities to their corresponding patches. However, existing resources such as Common Vulnerabilities and Exposures (CVE) often lack reliable patch references, e.g., many CVE entries do not provide patch commits, and a significant share of existing commits become inaccessible due to code repository changes. To bridge this gap and better facilitate vulnerability detection, we curate GitPatchDB, a large-scale, semantic-rich dataset that pairs CVEs with their corresponding patch commits, where each commit is formatted not only as code diffs but also as interprocedural program slices generated through program slicing and related program analysis techniques. To leverage this semantic-rich dataset, we further propose Contrastive Natural-language Programming-language Pre-training (CNPP), a novel approach that enables multimodal vulnerability patch search via contrastive learning. Extensive evaluations demonstrate that GitPatchDB paired with CNPP achieves 95.99% accuracy in vulnerability patch search, surpassing baseline methods by over 8% and establishing a new state-of-the-art performance.
Primary Area: datasets and benchmarks
Submission Number: 13341
Loading