Hyperbolic geometry-based deep learning methods to produce population trees from genotype data

Aman Patel

13 May 2023OpenReview Archive Direct UploadReaders: Everyone

Abstract: The production of population-level trees using the genomic data of individuals is a fundamental task in the field of population genetics. Typically, these trees are produced using methods like parsimony, distance-based, maximum likelihood or Bayesian approaches. However, such methods do not integrate easily with larger workflows, and they do not allow for observation of the data in a continuous space rather than a discrete tree structure. In this study, we aim to address these problems by introducing deep learning methods for tree formation from genotype data. Our models specifically create continuous representations of population trees in hyperbolic space, which has previously proven highly effective in embedding hierarchically structured data. We present two different architectures - a multi-layer perceptron (MLP) and a variational autoencoder (VAE) - and we analyze their performance using a variety of metrics along with comparisons to established tree-building methods. Both models tested with human sequences produce embedding spaces that reflect human evolutionary history. In addition, we demonstrate the generalizability of these models by verifying that addition of new samples to an existing tree occurs in a semantically meaningful manner. Next, we use Dasgupta’s Cost to compare the quality of trees generated by our models to those produced by established methods including nearest and furthest point, UPGMA, and WPGMA. Despite the fact that the benchmark methods are directly fit on the evaluation data, our models are able to outperform some of these and achieve highly comparable performance overall. Finally, we verify the performance of our models on simulated data with known population structure, and utilize this data to suggest extensions and further improvements.

0 Replies