OmicsLM: A Multimodal Large Language Model for Multi-Sample Omics Reasoning
Keywords: multimodal large language models, omics foundation models, transcriptomics, single-cell RNA-seq, bulk RNA-seq, instruction tuning, biological question answering, gene expression, cross-modal alignment, perturbation prediction, cell type annotation, GEO, multi-sample reasoning, conversational biology, continuous token embeddings, Qwen3, Gene Expression Omnibus, biological benchmarks, zero-shot transfer
TL;DR: OmicsLM is a multimodal LLM that injects transcriptomic profiles as continuous tokens into an LLM's context, enabling multi-sample biological reasoning that matches specialized omics models and outperforms general LLMs on omics QA.
Abstract: Interpreting transcriptomic data is one of the most common analytical tasks in modern biology. Yet most models either consume expression profiles without producing natural-language explanations, or reason in language without direct access to quantitative omics measurements. We introduce OmicsLM, a multimodal LLM that connects quantitative omics profiles with natural-language biological tasks by representing each transcriptomic profile as a compact continuous embedding in the LLM context. This preserves quantitative expression signal while supporting natural-language instructions, explicit gene mentions, and multiple interleaved samples.
We train OmicsLM on more than 5.5 million instruction-following examples spanning over 70 task types, combining continuous transcriptomic inputs, diverse language templates, and free-text biological knowledge. This mixture covers cell type annotation, perturbation prediction, clinical prediction, pathway reasoning, and open-ended biological question answering.
Existing benchmarks evaluate either profile-level prediction or text-only biological QA, leaving language-guided, multi-sample reasoning over expression profiles unmeasured. To close this gap, we introduce GEO-OmicsQA, a benchmark for multi-sample biological question answering built from Gene Expression Omnibus (GEO) studies. We demonstrate that OmicsLM can use expression profiles directly and perform comparably to specialized omics models on profile-level tasks, while outperforming both omics-specialized models and general LLMs on language-guided biological reasoning over expression data.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 59
Loading