Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: Benchmark, Biological Language Model, Protein Evolution, Gene Editing
TL;DR: We propose TadABench-1M, a large-scale wet-lab protein activity dataset, built by a novel pipeline for consistency across experimental rounds.
Abstract: Large language models trained on biomolecular sequences—DNA, RNA, and proteins—exhibit impressive in silico scaling trends, yet their practical utility in laboratory protein engineering remains under‑explored. We assemble a million‑example, wet‑lab‑validated dataset comprising 31 rounds of directed evolution on the tRNA‑specific adenosine deaminase (TadA) that underlies adenine base editors. To harmonize labels across rounds, we introduce Seq2Graph, a scalable graph‑based reconciliation algorithm that mitigates sequencing noise. Leveraging this resource, we propose TadABench-1M, an application‑oriented benchmark that tasks models with ranking candidate variants for the next evolutionary round, given data from previous rounds.State‑of‑the‑art biological language models achieve a Spearman correlation of only $\rho \approx 0.1$ under this realistic setup, contrasting sharply with $\rho \approx 0.8$ on a random split of this dataset, revealing a striking gap between computational metrics and wet‑lab success. Controlled ablations show that sequence diversity and round coverage, rather than raw data density, dominate performance, pinpointing key bottlenecks for next‑generation biological language models. TadABench-1M provides a large-scale, realistic foundation for developing and evaluating pre-trained language models.
Submission Number: 46
Loading