Abstract: Models that map DNA and protein sequences into deep embeddings have been recently developed. While their ability to improve prediction in downstream tasks has been demonstrated, clear advantages and disadvantages of embedding types, and different means of applying them, are not yet available. In this paper we compare five different models (one for DNA, four for proteins) and different embedding aggregation methods with respect to their ability to preserve evolutionary and functional information, using a hierarchical tree approach. Specifically, we introduce a novel procedure that builds hierarchical clustering trees to assess the relative position of sequences in the embedding latent space, compared to the phylogenetic and functional similarities between sequences. The methods are benchmarked on five different datasets from various organisms. The ESM protein language model and DNABert emerge as best performers in different settings.
Loading