Additional Submission Instructions: For the camera-ready version, please include the author names and affiliations, funding disclosures, and acknowledgements.
Track: Track 2: Dataset Proposal Competition
Keywords: Causal Foundation Models, Foundational Datasets, Microbial Digital Twin, Systems Biology, Causal Inference, Interventional Trajectories, Model-Guided Experimental Design, Quality-Aware Tokenization, Multi-omics
TL;DR: We propose MetaOmics-10T: A foundational dataset to build causal digital twins of microbiomes. We enable forecasting and design, moving biology from observation to engineering.
Abstract: We propose **MetaOmics-10T**—an openly shareable, foundational dataset to unlock AI-accelerated discovery in microbial ecosystems. The dataset directly enables three high-impact AI tasks: (1) forecasting ecosystem dynamics, (2) predicting counterfactual outcomes of interventions, and (3) inverse-design of microbial therapies under safety constraints. MetaOmics-10T combines **10 trillion base pairs** reclaimed from public archives using a Quality-Aware Tokenization (QA-Token) framework with **100,000+ interventional trajectories** generated via model-guided experimental design. The result is a first-of-its-kind, probabilistic, intervention-ready corpus that addresses the principal bottleneck for causal modeling in microbiome science and provides an empirical testbed to assess the reach and limits of causal inference at scale.
Submission Number: 357
Loading