Unsupervised Whole-Genome Representation Learning Captures Bacterial Phenotypes

Published: 06 Mar 2025, Last Modified: 18 Apr 2025ICLR 2025 Workshop LMRLEveryoneRevisionsBibTeXCC BY 4.0
Track: Full Paper Track
Keywords: representation learning, bacteria, genome, genotype-to-phenotype, language model, NLP
TL;DR: Through self-supervised learning on hundreds of thousands of whole bacterial genomes we learn representations of genomes which are predictive of many phenotypes.
Abstract: Shifting from hand-crafted to learned representations of data has revolutionized fields like natural language processing and computer vision. Despite this, current approaches to bacterial phenotype prediction from the genome rely on training machine learning models on hand-crafted features, often binary indicators or counts of the presence of different conserved genomic elements and protein domains. Defining these shared elements and domains as our “genomic element vocabulary”, we tokenize entire bacterial genomes as sequences of these conserved elements and take advantage of advances in long-context language modeling to perform self-supervised whole-genome representation learning (WGRL). Through multi-task pretraining on a phylogenetically diverse dataset of hundreds of thousands of bacterial genomes, we present a genomic language model which produces representations of input genomes with features predictive of a broad range of phenotypes. We assess the quality of the learned representations through k-nearest neighbours prediction of 25 bacterial phenotypes, finding our WGRL representations more predictive than standard protein domain presence/absence representations for 23/25 different phenotypes. We additionally find the WGRL representations are robust to both poor genome assembly quality and incompleteness. Through learning the relationships between evolutionarily conserved genomic elements with self-supervised long-context language modeling, we demonstrate the first approach for extracting general-purpose whole-genome representations while preserving gene order.
Attendance: Cameron Dufault
Submission Number: 85
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview