Keywords: Biological Reasoning, DNA Foundation Models, Large Language Models (LLMs), Reinforcement Learning, Multimodal
TL;DR: BioReason introduces a novel DNA-LLM architecture where the LLM directly processes genomic information, achieving superior, interpretable multi-step biological reasoning and accelerating mechanistic discovery.
Abstract: Unlocking deep and interpretable biological reasoning from complex genomic data remains a major AI challenge limiting scientific progress. While current DNA foundation models excel at representing sequences, they struggle with multi-step reasoning and lack transparent, biologically meaningful explanations. BioReason addresses this by tightly integrating a DNA foundation model with a large language model (LLM), enabling the LLM to directly interpret and reason over genomic information. Through supervised fine-tuning and reinforcement learning, BioReason learns to produce logical, biologically coherent deductions. It achieves major performance gains, boosting KEGG-based disease pathway prediction accuracy from 86% to 98% and improving variant effect prediction by an average of 15% over strong baselines. BioReason can reason over unseen biological entities and explain its decisions step by step, offering a transformative framework for interpretable, mechanistic AI in biology. All data, code, and checkpoints are available at [https://github.com/bowang-lab/BioReason](https://github.com/bowang-lab/BioReason).
Primary Area: Machine learning for sciences (e.g. climate, health, life sciences, physics, social sciences)
Submission Number: 18371
Loading