Keywords: Genomics, Hyperbolic Metric Learning, Hierarchical Embeddings
Abstract: The bacterial kingdom remains largely unexplored, with new strains continuously being discovered. The exponential growing size of bacterial databases poses the need for succinct yet informative representations of these vast microbial collections, allowing for fast and efficient genome classification and comparison. To address this, we propose Hyperbiome, a metric learning framework that takes advantage of the geometry of the Poincarè ball to reconstruct the bacterial taxonomy and compute a latent space where distances reflect biological similarities between genomes. By incorporating the taxonomic hierarchy in hyperbolic space, we learn representative proxies at both the species and genus level, which guide the embedding of each bacterial assembly. Finally, using the species-level proxies, we build a compact index that enables rapid classification of new assemblies while avoiding exhaustive query-vs-all scans of the database. Experiments on AllTheBacteria, the largest bacterial database, demonstrate that \hbiome effectively captures biological relationships. Moreover, we show that our proxy-based index achieves high accuracy, substantially reduces computational querying costs, and generalizes successfully to previously unseen species, supporting continuous updates without retraining the metric model.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 17282
Loading