Keywords: large language models, microbiome foundation models, multi-modality sequencing data, self-supervised pretraining, causal language modelling, scaling, metagenomics, benchmark
TL;DR: We pretrain large language models on a 539K-sample microbiome corpus spanning four DNA sequencing modalities, investigate model and dataset scaling and demonstrate they outperform SOTA foundation models on our eight-task benchmark
Abstract: We explore the application of large language models (LLMs) to microbiome data, a domain that remains underexplored despite the rise of self-supervised learning in biology. We introduce Atlas, a large-scale pretraining dataset comprising over 539,000 data points from MGnify, spanning multiple DNA sequencing modalities including amplicon, assembly, and whole-metagenome data. Using Atlas, we train the Waypoint family of models, GPT-style causal language models trained to understand microbiomes. To enable standardized evaluation, we present Compass, a benchmark of eight downstream microbiome prediction tasks. We show that our pretrained Waypoint models outperform classical methods and prior foundation models, with gains driven by both dataset scale and representation choices. Our results establish pretrained LLMs as a strong and practical approach for microbiome prediction tasks.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 23
Loading