Abstract: Using a sequence’s $$k$$ -mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. Since $$k$$ -mer sets often reach hundreds of millions of elements, traditional data structures are impractical for $$k$$ -mer set storage, and Bloom filters and their variants are used instead. Bloom filters reduce the memory footprint required to store millions of $$k$$ -mers while allowing for fast set containment queries, at the cost of a low false positive rate. We show that, because $$k$$ -mers are derived from sequencing reads, the information about $$k$$ -mer overlap in the original sequence can be used to reduce the false positive rate up to $$30{\times }$$ with little or no additional memory and with set containment queries that are only 1.3–1.6 times slower. Alternatively, we can leverage $$k$$ -mer overlap information to store $$k$$ -mer sets in about half the space while maintaining the original false positive rate. We consider several variants of such $$k$$ -mer Bloom filters (kBF), derive theoretical upper bounds for their false positive rate, and discuss their range of applications and limitations. We provide a reference implementation of kBF at https://github.com/Kingsford-Group/kbf/ .
0 Replies
Loading