Darwin-7B: A Multi-Omic Foundation Model for the Human Gut Microbiome\\via Sparsified Quality-Aware Tokenization

Published: 03 Mar 2026, Last Modified: 03 Mar 2026ICLR 2026 Workshop FM4Science PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: foundation models, metagenomics, microbiome, tokenization, reinforcement learning, sparsified genomics, multi-omics, metabolomics, pathogen detection, causal inference
TL;DR: We introduce Darwin-7B, a 7B multi-omic foundation model pretrained on 8T bp via sparsified quality-aware tokenization, outperforming METAGENE-1 and Evo2-7B on six benchmarks at 18x faster inference.
Abstract: Public sequence archives hold over 100 petabases of sequencing data, yet the vast majority remains unusable for foundation-model pretraining due to heterogeneous quality and missing causal structure. We present a two-stage data reclamation pipeline — **sparsification** followed by **quality-aware tokenization** (QA-Token) — that lifts the usable fraction from 5% to 40% (8× more data). In the first stage, we systematically exclude uninformative bases using structured binary patterns. We evaluate 224 sparsification configurations, identifying a Pareto frontier for species-level taxonomic classification on the CAMI benchmark that spans 5.1× speedup (species F1=0.51) to near-lossless accuracy (species F1=0.994, ~1.0× speedup). In the second stage, QA-Token incorporates per-base Phred quality directly into vocabulary construction via multi-objective reward-guided bilevel optimization with Gumbel–Softmax relaxation. We validate the full pipeline with **Darwin-7B**, a 7B-parameter multi-omic foundation model pretrained on 8 trillion base pairs of metagenomics and 250K metabolite profiles. Darwin-7B outperforms METAGENE-1 and Evo2-7B on shared genomic benchmarks: 94.5 ± 0.4 Matthews correlation coefficient (MCC) on pathogen detection and 0.98 ± 0.01 F1 on metagenomic profiling. It also establishes first results on four multi-omic tasks not accessible to single-modality models: 0.91 ± 0.02 wF1 metabolic pathway prediction, 0.947 ± 0.012 AUC IBD, 0.883 ± 0.015 AUC T2D, and 0.910 ± 0.013 AUC antibiotic resistance. Inference is 18× faster than Evo2-7B, of which ~15× derives from the Mamba–Transformer hybrid architecture and ~1.2× from QA-Token compression. We further describe a pilot implementing the first phase of **MetaOmics-10T**, combining 10 trillion reclaimed base pairs with 100,000+ interventional trajectories for causal modeling.
Submission Number: 110
Loading