NovoBench-100K: A large-scale protein dataset for in silico evolution of de novo TadA

Jin Gao; Juntu Zhao; Jiaqi Shen; Dukun Zhao; Yuming Lu; Dequan Wang

NovoBench-100K: A large-scale protein dataset for in silico evolution of de novo TadA

Jin Gao, Juntu Zhao, Jiaqi Shen, Dukun Zhao, Yuming Lu, Dequan Wang

23 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Benchmark, Dataset, Biological Language Model, Ranking, Base Editing

TL;DR: We introduce NovoBench-100K, a large-scale dataset for the in silico evolution of TadA, an enzyme crucial for adenosine-to-inosine base editing in gene correction.

Abstract: We introduce NOVOBENCH-100K, a large-scale protein dataset for the in silico evolution of TadA, an enzyme critical for base editing. This dataset originates from the sequencing data collected during two rounds of our in vitro TadA evolution, encompassing 101,687 unique DNA variants with an average of 11.1 amino acid mutations. Rather than employing classes or scores as labels, our dataset consists of 77,900 ranking lists, each involving 2, 10, or 100 sequences ranked by their base editing efficiency. These rankings are generated using our SEQ2RANK, a novel algorithm that accounts for biological experiment credibility and ranking consistency. For evaluation, we provide two train-test splits, designated as in-domain ranking and out-of-domain ranking, based on a standard 7:3 random split and the actual in-vitro evolution rounds, respectively. We benchmark 80 biological language models (BLMs) across 24 papers, spanning protein, DNA, RNA, and multimodal domains. Comprehensive experiments reveal that BLMs perform well on in-domain ranking, with a detailed analysis by modality, model size, and K-mer. However, for out-of-domain ranking, BLMs exhibit poor performance in both linear probing and fine-tuning, resembling random guessing. This underscores the necessity for highly generalizable models to address domain shifts between experimental rounds. Finally, our wet experiments are ongoing to generate more data to expand our benchmark. In a few months, we expect to add additional rounds of in vitro evolution and include a broader variety of proteins. We will release the code, dataset, and embeddings of our evaluated 80 BLMs soon.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2841

Loading