Additional Submission Instructions: For the camera-ready version, please include the author names and affiliations, funding disclosures, and acknowledgements.
Track: Track 2: Dataset Proposal Competition
Keywords: Biomolecular Foundation Models, Genome Foundation Models, Protein Foundation Models, Cross-Modal Evaluation, Benchmarking
TL;DR: We propose a cross-modal benchmark to support the head-to-head evaluation of genome and protein foundation models.
Abstract: In recent years, biomolecular foundation models (bioFMs) have been trained on massive amounts of omics data to encode complex patterns in biological sequences. These models have shown remarkable predictive performance across a broad range of applications in biotechnology, as well as the ability to generate novel, viable sequences. However, the vast majority of existing bioFMs are unimodal, trained exclusively on nucleotide or amino acid sequences, and are evaluated on tasks specific to their sequence type. With few benchmarks incorporating multi-omics data, opportunities for cross-modal evaluations are limited. To address this gap, we propose a novel cross-modal benchmark that links nucleotide and amino acid sequences to common biological outcomes. Given a pair of genes, our benchmark will pose questions such as: _Do the encoded proteins co-localize or share similar functions? Are the genes associated with a common disease or linked to common drug targets?_ By providing a common platform for evaluation, our benchmark will support comparisons of unimodal and multimodal bioFMs, offering a foundation for tracking their capabilities and informing appropriate safety oversight.
Submission Number: 389
Loading