Keywords: metagenomic foundation model, genomic language model, DNA, LLM, metagenomics
TL;DR: We train the first foundation model on metagenomics data curated from diverse wastewater samples. We scale the model to 7 billion parameters, and describe all of our pretraining, fine-tuning, and evaluation procedures.
Abstract: We pretrain a 7-billion-parameter autoregressive transformer language model, which we refer to as a *metagenomic foundation model (MGFM)*, on a novel corpus of diverse metagenomic DNA and RNA sequences comprising over 1.5 trillion base pairs. This dataset is sourced from a large collection of human wastewater samples, processed and sequenced using deep metagenomic (next-generation) sequencing methods. Unlike genomic models that focus on individual genomes or curated sets of specific species, the aim of MGFM is to capture the full distribution of genomic information present within this wastewater. We carry out byte-pair encoding (BPE) tokenization on our dataset, tailored for metagenomic sequences, and then pretrain our model. In this paper, we first detail the pretraining dataset, tokenization strategy, and model architecture, highlighting the considerations and design choices that enable the effective modeling of metagenomic data. We then show results of pretraining this model on our metagenomic dataset, providing details about our losses, system metrics, and training stability over the course of pretraining. Finally, we demonstrate the model's capabilities through empirical results on an initial set of genomic benchmark and out-of-distribution detection tasks, showcasing its potential for various metagenomic applications.
Submission Number: 73
Loading