ON THE IMPACT OF EMBEDDING ANISOTROPY IN GENOMIC LANGUAGE MODELS FOR BACTERIAL TAXONOMY

Published: 02 Mar 2026, Last Modified: 10 Mar 2026Gen² 2026 PosterEveryoneRevisionsCC BY 4.0
Track: Full / long paper (5-8 pages)
Keywords: anisotropy, 16srRNA, gLM
TL;DR: 16s sequence embeddings generated by DNABERT2 are highly anisotropic, and whitening succesfully transformed it into an isotropic state boosting classification accuracy.
Abstract: Genomic language models have emerged as powerful tools for representing DNA sequences, yet the impact of intrinsic properties of pre-trained embeddings, such as anisotropy, on downstream genomic tasks remains underexplored. In this work, we examine the geometric structure of DNABERT-2 embeddings derived from full-length 16S rRNA gene sequences and analyze how anisotropy affects bacterial taxonomic classification. We compare raw embeddings with post-processed representations obtained through a simple whitening transformation and evaluate their performance using distance-based classification across multiple taxonomic ranks. Our results show that DNABERT-2 embeddings exhibit severe anisotropy and that whitening substantially improves isotropy and consistently enhances classification performance, particularly at finer-grained taxonomic levels. These findings highlight the importance of embedding geometry when deploying genomic language models for downstream biological analysis.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 23
Loading