GenomeOcean: Efficient Foundation Model for Genome Generation

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Genome Foundation Model, Genome Generation
TL;DR: A billion-parameter genome foundation model for genome generation.
Abstract: We introduce GenomeOcean, a 4-billion-parameter genome foundation model that natively generates DNA sequences that are adherent to the input context. With an efficiency-oriented model design, GenomeOcean is 80 times faster than existing models of similar size in genome generation. Unlike most existing genome foundation models—such as DNABERT and Nucleotide Transformers—that are designed for discriminative tasks, GenomeOcean leverages generative modeling to unlock new potentials in genomics research. Diverging from the traditional reliance on reference genomes—which possess inherent biases—GenomeOcean is exclusively trained on large-scale curated environmental samples collected from diverse ecosystems, including oceans, lakes, forests, and soils. This extensive genomic diversity, encompassing uncultured and uncharacterized organisms, allows GenomeOcean to generate sequences that better reflect the true diversity of life. In a series of automated evaluations, we demonstrate GenomeOcean's capability to understand and follow context sequences. Compared to existing models, GenomeOcean not only better retains species information but also produces sequences with more appropriate open reading frame lengths and codon usage bias. We anticipate the open release of GenomeOcean to open up new possibilities in genomics and computational biology research.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8373
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview