CausalOmics-10T: An Evolving Foundational Dataset to Enable Causal Modeling of Microbial Ecosystems
Track: Full / long paper (5-8 pages)
Keywords: metagenomics, foundation models, data reclamation, sparsified genomics, quality-aware tokenization, multi-omics, causal inference, microbial ecosystems, reinforcement learning
TL;DR: We present a sparsification and quality-aware tokenization pipeline that reclaims public metagenomic data, train the first multi-omic microbial foundation model, and propose a 10-trillion-base-pair dataset for causal modeling of microbial ecosystems.
Abstract: Public microbiome archives hold over 100 petabases of sequencing data, yet we estimate 95% remains unusable for foundation-model pre-training due to heterogeneous quality, noise, and missing causal structure. We present a two-stage data reclamation pipeline, **sparsification** followed by **quality-aware tokenization (QA-Token)**, that lifts the usable fraction of public archives from 5% to 40% (+35 pp, $8\times$ data). In the first stage, structured binary patterns systematically exclude uninformative bases; we evaluate 224 sparsification configurations on the CAMI benchmark and identify a compact Pareto frontier of 12–14 configurations achieving up to $5.1\times$ speedup at F1$=$0.994. In the second stage, a reinforcement-learning framework incorporates per-base Phred quality directly into vocabulary construction, producing hierarchically structured, semantically meaningful tokens. We validate the full pipeline by training **Quorum-7B**, a 7B-parameter multi-omic foundation model pre-trained on 1.3 trillion base pairs of metagenomics and 500K metabolite profiles, which outperforms METAGENE-1 and Evo2-7B on two benchmarks with competitive baselines (93.0$\to$93.5 MCC on pathogen detection; 0.89$\to$0.91 F1 on metagenomic profiling) and establishes first results on four multi-omic benchmarks including metabolic pathway prediction (wF1 0.85) and three clinical tasks, at $18\times$ faster inference. Building on these results, we propose **CausalOmics-10T**, a foundational dataset combining 10 trillion base pairs reclaimed via this pipeline with 100,000+ interventional trajectories generated through model-guided experimental design, targeting three high-impact AI tasks, including forecasting, counterfactual prediction, and safe inverse design of microbial therapies.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 44
Loading