RustBuildEq: A Benchmark for Binary Equivalence Under Build Variability

Elliott Wen; Chenye Ni; Valerio Terragni; Jens Dietrich

RustBuildEq: A Benchmark for Binary Equivalence Under Build Variability

Elliott Wen, Chenye Ni, Valerio Terragni, Jens Dietrich

Published: 14 May 2026, Last Modified: 14 May 2026AIWare 2026 Benchmark and DatasetEveryoneRevisionsCC BY 4.0

Keywords: Binary Equivalence, Software Supply Chain, Security Reproducible Builds, Build Variability, Rust

TL;DR: We introduce RustBuildEq, a large-scale benchmark for evaluating binary equivalence in Rust under realistic build variability, capturing both equivalent and non-equivalent artifacts across diverse toolchains and configurations.

Abstract: Reproducible independent rebuilds strengthen software supply-chain integrity by recreating the original build environment and enforcing bitwise equivalence between artifacts. However, this approach implicitly assumes a trustworthy toolchain and can fail under adversarial manipulation of the build process itself (e.g., the Ken Thompson attack). Prior work has explored introducing diversity across build environments to reduce reliance on any single toolchain, and has proposed AI-driven methods to establish behavioural equivalence while tolerating benign build variability in the Java ecosystem. In this work, we extend this line of research to Rust and present \textit{RustBuildEq}, a benchmark for training and evaluating binary equivalence classier models under realistic build variability. We curate a large corpus of crates drawn from the top 20\% of the crates.io ecosystem and construct datasets of equivalent (EQ) and non-equivalent (NEQ) pairs with rich provenance metadata. EQ pairs are generated from identical source revisions under varying toolchain versions and build configurations, while NEQ pairs are derived from AST rewrites or API-breaking changes across versions. Many rust crates rely heavily on generics and cannot be compiled into binaries without specifying concrete types; to address this, we develop an automated approach that combines heuristic type instantiation, witness-type synthesis, and an iterative AI repair loop. RustBuildEq comprises 19,184,671 EQ records and 273,848,531 NEQ records, and includes a Python API for dataset navigation. The dataset provides large-scale ground truth for training and evaluating AI-driven models for reasoning binary equivalence and is publicly available at \url{https://doi.org/10.5281/zenodo.19244908}.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 10

Loading