Does your model understand genes? A benchmark of gene properties for biological and text models

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Benchmark, Data Sets or Data Repositories, Computational Biology and Bioinformatics
TL;DR: We propose a unified benchmark for comparing gene representations that are produced by models trained across many modalities, and find that performance is largely determined by training data modality
Abstract: The application of deep learning for biology, including foundation models, has increased significantly in recent years. Some models are text-based, while others are trained on the underlying biological data, especially omics data of various modalities. Consistently comparing the performance of deep learning models for biology has proven challenging due to the diversity of training data and downstream tasks. Here, we utilize the fact that many models operate on the level of genes and propose a unifying benchmark by defining hundreds of tasks based on ground-truth gene properties collected from professionally curated bioinformatics databases. We collect properties of five types: (1) genomic properties, including predicting which genes can be methylated or which are dose-dependent; (2) regulatory functions, evaluating how the genes participate in cellular regulatory processes; (3) localization, including identification of differential expression in different tissues or sub-cellular localization; (4) biological processes, including predicting gene involvement in pathways or disease prognostics; and (5) protein properties, including prediction of functional domains or post-translational modifications. These properties are used to define binary, multi-label and multi-class classification tasks. To create an architecture-agnostic benchmark we extract gene representation vectors from each model, including single-cell RNA-seq (scRNA) foundation models, large language models, protein language models, DNA foundation models, and classical baselines, and use them to train simple predictive models on the tasks. Depending on the model, we utilize the model's token-level embeddings of gene symbols or transform the gene symbol to an input appropriate for the model, i.e. a description of the gene for text models, the gene sequence for DNA models or amino acid sequences for the protein models. Using these embeddings on the benchmark tasks, we create a detailed assessment of the relative performance of the different models. In general, we find that text-based models and protein language models outperform the expression-based models on tasks related to genomic properties and regulatory functions, while expression-based models tend to outperform the others on localization tasks. We also observe performance for the classical bag-of-words baseline that is similar to the large language models for many tasks. By enabling broad systematic evaluation of diverse deep learning models in biology, this benchmark can help direct future research in artificial intelligence toward improved biological understanding and accelerated therapeutic discoveries. The code and benchmark data can be extended to more models and tasks and is available on GitHub.
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6953
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview