TadABench-1M: A Large-Scale Wet-Lab Protein Benchmark For Rigorous OOD Evaluation

ICLR 2026 Conference Submission3776 Authors

10 Sept 2025 (modified: 26 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Benchmark, AI for Science, Biological Language Model, Protein Engineering, Gene Editing, Protein Fitness Dataset
TL;DR: This paper introduces TadABench-1M, a benchmark for evaluating out-of-distribution (OOD) generalization in Biological Language Models (BLMs) using protein engineering data.
Abstract: Existing benchmarks for biological language models (BLMs) inadequately capture the challenges of real-world applications, often lacking realistic out-of-distribution (OOD) scenarios, evolutionary depth, and consistency in measurement. To address this, we introduce TadABench-1M, a new benchmark based on a wet-lab dataset of over one million variants of the therapeutically relevant TadA enzyme, purpose-built to embody these three essential attributes. Generated across 31 rounds of wet-lab evolution, it offers unparalleled evolutionary depth and naturally presents a stringent OOD challenge. To ensure measurement consistency across this extensive campaign, we developed Seq2Graph, a scalable graph-based algorithm that systematically unifies multi-batch experimental data. Our high-fidelity benchmark highlights a critical finding: while state-of-the-art BLMs excel on a standard random split of the data (Spearman’s ρ ≈ 0.8), they fail dramatically on a realistic temporal prediction task (ρ ≈ 0.1). This stark performance gap validates the importance of our benchmark’s design principles and suggests that evolutionary depth is critical for building models with realistic utility.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 3776
Loading