Keywords: Multi-omics Large Language Models
TL;DR: We benchmark MOLLMs and denote MOE as aligning multi-omics encoders with an LLM, showing superior performance and a reduced gap to specialist models across nine tasks.
Abstract: Understanding living systems requires interpretable models to elucidate how multi-omics data coordinate transcription and translation across spatiotemporal scales.
Inspired by large language models (LLMs), biological foundation models pretrained the omics sequences have shown exciting performance. However, these biological models lack interpretability and transparency in explaining the results. Motivated by advances in cross-modal alignment from vision–language models (VLMs), it is naturally to integrate multi-omics data and nature language into one system: multi-omics large language model (MOLLM), a LLM-based model can understand multi-omics data. To understand the trends, challenges, and limitations of MOLLMs, we provide a comprehensive empirical study on MOLLMs. We systematically review recent progress on MOLLMs based on their omics-encoding design and benchmark the performance gap between MOLLMs with omics-specific models. The extensive experiments show that the proposed multi-omics-encoding design outperforms existing MOLLMs by a large margin and shows promise for narrowing the performance gap against specialist biological models. Code is available at \href{https://anonymous.4open.science/r/BioMLLM_V2-B5E2}{https://anonymous.4open.science/r/mollm}.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 934
Loading