Another virtue of wavelet forests

Published: 01 Jan 2024, Last Modified: 08 Jul 2025DCC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The FM-index is one of the main success stories of the field of compact data structures and is a key part of many important tools in bioinformatics. Its primary weakness is a lack of access locality, with each step in a backward search typically causing several cache misses. If the indexed text is more than about lg σ times the size of cache, where σ is the size of the alphabet, then the bitvector at each level of the wavelet tree over the Burrows-Wheeler Transform (BWT) of the text may by itself be larger than cache — causing a cache miss as we descend from each level of the wavelet tree to the next. The resulting slowdown can be enough to cause practitioners to switch from FM-indexes to compressed suffix arrays, which have somewhat better locality.
Loading