Abstract: Counting the occurrence frequency of each k-mer in a biological sequence is a preliminary yet important step in many bioinformatics applications. However, most k-mer counting algorithms rely on a given k to produce single-length k-mers, which is inefficient for sequence analysis for different k. Moreover, existing k-mer counters focus more on DNA and RNA sequences and less on protein ones. In practice, the analysis of k-mers in protein sequences can provide substantial biological insights in structure, function, and evolution. To this end, an efficient algorithm, called MulMer (Multiple-Mer mining), is proposed to mine k-mers of various lengths termed multi-mers via inverted-index technique, which is orders of magnitude faster than the conventional forward-index methods. Moreover, to the best of our knowledge, MulMer is the first able to mine multi-mers in a variety of sequences, including DNA, RNA, and protein sequences.
0 Replies
Loading