TadABench-1M: A Large-Scale Wet-Lab Protein Benchmark For Rigorous OOD Evaluation

Published: 24 Sept 2025, Last Modified: 26 Dec 2025NeurIPS2025-AI4Science SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Additional Submission Instructions: For the camera-ready version, please include the author names and affiliations, funding disclosures, and acknowledgements.
Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: Benchmark, Biological Language Model, Protein Evolution, Gene Editing
TL;DR: We propose TadABench-1M, a large-scale wet-lab protein activity dataset, built by a novel pipeline for consistency across experimental rounds.
Abstract: Existing benchmarks for biological language models (BLMs) inadequately capture the challenges of real-world applications, often lacking realistic out-of-distribution (OOD) scenarios, evolutionary depth, and measurement consistency. To address this, we introduce TadABench-1M, a new benchmark based on a wet-lab dataset of over one million variants of the therapeutically relevant TadA enzyme, purpose-built to embody these three essential attributes. Generated across 31 rounds of wet-lab evolution, it offers unparalleled evolutionary depth and naturally presents a stringent OOD challenge. To ensure measurement consistency across this extensive campaign, we developed Seq2Graph, a scalable graph-based algorithm that systematically unifies multi-batch experimental data. Our high-fidelity benchmark highlights a critical finding: while state-of-the-art BLMs excel on a standard random split of the data (Spearman's $\rho \approx 0.8$), they fail dramatically on a realistic temporal prediction task ($\rho \approx 0.1$). This stark performance gap validates the importance of our benchmark’s design principles and suggests that evolutionary depth is critical for building models with true realistic utility.
Submission Number: 46
Loading