Hyperbolic Variational Autoencoders for Phylogenetic Latent Spaces: Geometric Priors for Evolutionary Sequence Modeling
Keywords: Hyperbolic geometry, variational autoencoders, phylogenetics, Poincaré ball, evolutionary sequence modeling, Pfam, GISAID, wrapped normal, ELBO
TL;DR: PhyloVAE uses hyperbolic latent space for phylogenetic sequences; $O(\delta)$ distortion vs $\Omega(n^{1/d})$ for Euclidean, with 15--22\% better distance preservation.
Abstract: Biological sequences evolve along phylogenetic trees, yet standard VAEs embed them in flat Euclidean spaces that distort tree-like hierarchical structure. We introduce PhyloVAE, a variational autoencoder with hyperbolic latent geometry that naturally encodes evolutionary relationships. Using the Poincaré ball model of hyperbolic space $\mathbb{H}^d$, we derive a closed-form hyperbolic reparameterization trick for the wrapped normal distribution and prove that the ELBO decomposes into a reconstruction term plus a hyperbolic KL divergence admitting an analytic expression. Our main theoretical result shows that PhyloVAE's latent space distortion of phylogenetic distances is $O(\delta)$ where $\delta$ is the tree's hyperbolicity constant, compared to $\Omega(n^{1/d})$ for Euclidean VAEs on $n$-taxa trees—an exponential improvement. We further prove that the posterior concentrates around the maximum likelihood phylogeny at rate $O(n^{-1/2})$ in Wasserstein distance on the Phylogenetic Orangespace. On protein family clustering (Pfam), viral evolution tracking (GISAID SARS-CoV-2), and RNA secondary structure prediction, PhyloVAE achieves 15-22\% improvement in phylogenetic distance preservation while maintaining competitive reconstruction accuracy (BLEU $\geq 0.94$ for sequences). Our framework opens new directions for geometry-aware generative modeling in computational biology.
Submission Number: 152
Loading