Biology Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models
Abstract: Large language models (LLMs) have shown remarkable capabilities in general domains, but their application to multi-omics biology remains underexplored. To address this gap, we introduce Biology-Instructions, the first large-scale instruction-tuning dataset for multi-omics biological sequences, including DNA, RNA, proteins, and multi-molecules. This dataset bridges LLMs and complex biological sequences-related tasks, enhancing their versatility and reasoning while maintaining conversational fluency. We also highlight significant limitations of current state-of-the-art LLMs on multi-omics tasks without specialized training. To overcome this, we propose ChatMultiOmics, a strong baseline with a novel three-stage training pipeline, demonstrating superior biological understanding through Biology-Instructions. Both resources are publicly available, paving the way for better integration of LLMs in multi-omics analysis. The Biology-Instructions is publicly available at \href{https://anonymous.4open.science/r/Biology-Instructions-FD66/ }{link}.
Paper Type: Long
Research Area: Dialogue and Interactive Systems
Research Area Keywords: Instruction benchmark, Multi-omics, AI for Biology
Contribution Types: Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English, biology
Submission Number: 6512
Loading