Alignment-Free Estimation of Read to Genome Distances and Its Applications

Published: 01 Jan 2025, Last Modified: 15 May 2025RECOMB 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Searching reads from unknown origins in a reference database and finding evolutionarily similar genomes is central to many applications. Quantifying the similarity by estimating the distance between each read and matching references could further help downstream analyses such as taxonomic characterization or even placing the read on a reference phylogeny. Such distances can be computed using alignment. Since alignment becomes impractical for ultra-large reference databases, many use \(k\)-mer-based tools for downstream tasks, but these methods do not compute distances. We introduce krepp to compute the distance between a read and a genome using a maximum likelihood framework built around \(k\)-mers . We further introduce algorithms to place reads on a reference tree. Empirical evaluations show that krepp accurately estimates distances, scales well with large databases, and can place short reads coming from any part of the genome on a reference phylogeny.
Loading