COMPOTE: Generating a Dataset of Real-World Binary Level Vulnerabilities

ICLR 2026 Conference Submission16626 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI, LLM, Vulnerability Detection, Binary-Level, Cybersecurity, Code Generation, Dataset Creation, LLVM-IR, Compilation
TL;DR: We present Compote, an AI tool that converts real-world vulnerable C code into compilable binaries, enabling CompRealVul a large-scale, realistic dataset for training and evaluating binary-level vulnerability detection models.
Abstract: Once a proprietary program written in a compiled language like C is successfully compiled, it is typically distributed as a binary executable. Consequently, security analysis of the program, including vulnerability detection, relies solely on the binary. Binary-level detection methods have been developed over the years, with machine learning (ML)-based methods becoming increasingly popular in the last decade. However, the scarcity of high-quality, publicly available datasets limits the development of ML-based binary vulnerability detectors, as existing binary-level vulnerability datasets are often synthetic and fail to reflect real-world vulnerabilities. At the same time, existing real-world source-code vulnerability datasets cannot be directly compiled, as they typically consist of standalone function snippets rather than compilable programs. To address this limitation, we present Compote, a COMPilation AI‑Orchestrated Transformation Engine that automatically wraps standalone C functions with the minimal scaffolding, such as headers, mocks, and main(), needed for successful compilation of C functions without altering the original code. Applying Compote to real-world functions from ten public datasets of vulnerable code yields a dataset comprising 18K compilable C functions along with their compiled binary versions. Our dataset represents a novel, large-scale, realistic, labeled benchmark spanning both source and binary domains. To evaluate our dataset, we fine-tune state-of-the-art vulnerability detection models. We show that models trained and tested exclusively on existing (synthetic) datasets achieve up to 98.97% F1 but drop to 29.28% when tested on the real-world vulnerabilities in our dataset. This demonstrates the inability of models trained on synthetic datasets to generalize effectively to real-world binary vulnerabilities, resulting in a significant drop in detection performance. We release Compote and our datasets to the research community to support further research on building and evaluating effective and practical binary vulnerability detection models.
Primary Area: datasets and benchmarks
Submission Number: 16626
Loading