Scaling Laws and Architectural Frontiers in Metagenomic Foundation Models

Geraldene Munsamy; Gavin Ayres; Jérémie DONA; Carla Greco; Daniel Anderson; Srijani Sridhar; William Chow; Aaron W Kollasch; Robert J. Pecoraro; Tanggis Bohnuud; Keith Kam; Gus Minto-Cowcher; Marcus H Y Leung; Hassan Sirelkhatim; John St. John; Ali Taghibakhshi; Tyler C. Shimko; Jared Wilber; Timur Rvachov; Saee Gopal Paliwal; Eddie Calleja; Noelia Ferruz; Kevin K Yang; Philipp Lorenz; Francesco Farina

Scaling Laws and Architectural Frontiers in Metagenomic Foundation Models

Published: 02 Mar 2026, Last Modified: 10 Mar 2026Gen² 2026 PosterEveryoneRevisionsCC BY 4.0

Track: Full / long paper (5-8 pages)

Keywords: Genomic Language Models, Biology, Transformers, Metagenomics, Scaling Laws, Bioinformatics

TL;DR: A blueprint for training of metagenomic language models

Abstract: Foundation models for genomics have the potential to revolutionize therapeutic design, yet the optimal architectural choices for modeling the vast and diverse distribution of metagenomic data remain under-explored. In this work, we present the machine learning methodology behind EDEN, a family of metagenomic foundation models scaled up to 28 billion parameters and trained on 9.7 trillion nucleotide tokens. We provide a systematic empirical study of architectural trade-offs between autoregressive Transformers (Llama-style), State-Space Models (Mamba), and Long-convolutional architectures (Hyena) for nucleotide-level modeling. Contrary to recent trends favoring linear-time sequence models for long-range biological data, we demonstrate that the Llama architecture exhibits superior scaling efficiency and semantic retrieval capabilities as the model capacity grows. We derive a set of quality-aware scaling laws for metagenomics, showing how model performance follows predictable power-law behavior across three orders of magnitude in parameters and data. Through extensive benchmarking, spanning unsupervised zero-shot fitness prediction, semantic completion, and gene recovery, we establish a blueprint for scaling biological foundation models and provide empirical evidence demonstrating why Transformer-based architectures define the current frontier

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 8

Loading