Reconstructing Phylogenies Using Branch-Variable Substitution Models and Unaligned Biomolecular Sequences: A Performance Study and New Resampling Method

Published: 06 Sept 2023, Last Modified: 08 Apr 2024OpenReview Archive Direct UploadEveryoneCC BY-NC-SA 4.0
Abstract: In many clades in the Tree of Life, nucleotide substitution rates and base frequencies are hypothesized to have changed as genome evolution unfolded over time. Rigorous testing of this hypothe- sis relies on accurate phylogenetic reconstruction under suitable models of biomolecular sequence evolution. By far the most com- mon approach for phylogenetic reconstruction is a “two-phase” analysis, where unaligned biomolecular sequence data are first aligned, and the resulting multiple sequence alignment (MSA) is used as input to downstream phylogenetic reconstruction. For a traditional “homogeneous” substitution model that is fixed across a species phylogeny, it has long been established that accurate phylo- genetic inference and learning requires accurate upstream multiple sequence alignments. But the same question has not been carefully studied for “heterogeneous” models of substitution processes that can vary across the branches of a phylogeny. We therefore conducted a comprehensive performance study to quantify the impact of upstream MSA estimation error on down- stream phylogenetic inference and learning under branch-variable models of nucleotide substitution. Across model conditions with either 10 or 20 taxa and spanning a range of evolutionary diver- gence, we find a consistent and significantly positive association between upstream and downstream estimation error. The relation- ship is robust to the choice of MSA estimation method as well as substitution model mis-specification. We further quantify the relatively large contribution of upstream MSA estimation error to downstream phylogenetic reconstruction quality, compared to other experimental factors. We also conducted an empirical study of flowering monocots. Phylogenetic analyses of orthologous genes in the clade confirm the simulation study findings, and species tree estimation using branch-variable substitution models reveals new insights into sequence evolution heterogeneity. Our findings underscore several key gaps in the state of the art, including the need for MSA-aware phylogenetic inference and learning methods under heterogeneous models of sequence evolution. To this end, we introduce a new computational method, NoHTS (“Non-Homogeneous Tree Support”), to directly assess phylogenetic estimation uncertainty due to MSA estimation error and other factors. The new method uses sequence-aware statistical resampling to place confidence intervals on a phylogeny estimated under a branch-variable substitution model. We demonstrate its superior type I and type II error versus a de facto standard in phylogenetic and phylogenomic studies – the phylogenetic bootstrap method.
Loading