Keywords: biological sequences, variational autoencoders, latent representations, ornstein-uhlenbeck process, evolution
Abstract: We introduce a deep generative model for representation learning of biological sequences that, unlike existing models, explicitly represents the evolutionary process. The model makes use of a tree-structured Ornstein-Uhlenbeck process, obtained from a given phylogenetic tree, as an informative prior for a variational autoencoder. We show the model performs well on the task of ancestral sequence reconstruction of single protein families. Our results and ablation studies indicate that the explicit representation of evolution using a suitable tree-structured prior has the potential to improve representation learning of biological sequences considerably. Finally, we briefly discuss extensions of the model to genomic-scale data sets and the case of a latent phylogenetic tree.
One-sentence Summary: Ancestral protein sequence reconstruction using a tree-structured Ornstein-Uhlenbeck variational autoencoder