Exploring Set-Aggregated Genome Embeddings for Microbiome Abundance Prediction

Younhun Kim; Georg K. Gerber; Travis E Gibson

Exploring Set-Aggregated Genome Embeddings for Microbiome Abundance Prediction

Younhun Kim, Georg K. Gerber, Travis E Gibson

Published: 28 May 2026, Last Modified: 03 Jun 2026ICML 2026 FM4LS Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Microbiome, Genomic Language Models, Feature Extraction

TL;DR: We study a deep-learning architecture which uses pre-trained LLM embeddings of bacterial genomes to produce pooled representations of microbiomes.

Abstract: Microbiome functions are encoded within the genes of the community-wide metagenome. A natural question is whether properties of a microbial community can be predicted just from knowing the raw DNA sequences of its members. In this work, we employ set-aggregated genome embeddings (SAGE) to predict community-level abundance profiles, exploiting the few-shot learning capabilities of genomic language models (GLMs). We benchmark this approach to show improved generalization on novel genomes compared to classical bioinformatics approaches. Model ablation shows that community-level latent representations directly result in improved performance. Lastly, we demonstrate the benefits of intermediate transformations between latent representations and the differences between GLM embedding choices.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 61

Loading