GeneGench: Systematic Evaluation of Genomic Foundation Models and Beyond

Zicheng Liu; Jiahui Li; Lei Xin; Siyuan Li; Chang Yu; Zelin Zang; Cheng Tan; Yufei Huang; yajingbai; Jun Xia; Stan Z. Li

GeneGench: Systematic Evaluation of Genomic Foundation Models and Beyond

Zicheng Liu, Jiahui Li, Lei Xin, Siyuan Li, Chang Yu, Zelin Zang, Cheng Tan, Yufei Huang, yajingbai, Jun Xia, Stan Z. Li

Published: 13 Oct 2024, Last Modified: 01 Dec 2024AIDrugX PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: genetic foundation model, benchmark, hybrid model

TL;DR: We introduce GenBench, a benchmark suite for evaluating Genomic Foundation Models (GFMs). Based on our experimental insights, we propose GenHybrid, an effective SSM-attention hybrid model suitable for all tasks.

Abstract: The Genomic Foundation Model (GFM) paradigm is expected to facilitate the extraction of generalizable representations from massive genomic data, thereby enabling their application across a spectrum of downstream applications. Despite advancements, a lack of evaluation framework makes it difficult to ensure equitable assessment due to experimental settings, model intricacy, benchmark datasets, and reproducibility challenges. In the absence of standardization, comparative analyses risk becoming biased and unreliable. To surmount this impasse, we introduce GeneBench, a comprehensive benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models. GeneBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies. Through systematic evaluations of datasets spanning diverse biological domains with a particular emphasis on both short-range and long-range genomic tasks, firstly including the three most important DNA tasks covering Coding Region, Non-Coding Region, Genome Structure, etc. Our results on GenBench have led to an interesting discovery: regardless of the number of parameters, the noticeable variation in preference between attention-based and convolution-based models for short- and long-range tasks could offer valuable insights for the future development of GFM. As a result, we propose a straightforward modified model called Genhybrid, which is an effective and efficient convolution-attention hybrid model suitable for all tasks.

Submission Number: 139

Loading