Abstract: Finding repetitive nucleic acid elements is a crucial step in many sequence analysis tasks. These include the challenging task of sequence assembly, the linkage of repeats to genetic disorders, and the identification of gene transfer. The most widely-used tool for finding repeats de novo is REPuter [2]. REPuter relies on extending maximal repeated pairs in order to enumerate all maximal k-mismatch repeats. Unfortunately, the number of these pairs can be quadratic in n, the length of the input sequence, and thus greedy heuristics are applied by its successor Vmatch to speed up the extension process. In this talk, we will introduce the concept of supermaximal k-mismatch repeats, whose number is linear in n, and capture all maximal k-mismatch repeats: every maximal k-mismatch repeat is a substring of some supermaximal k-mismatch repeat. We will present SMART, a tool based on recent algorithmic advances implemented in C++ to compute supermaximal k-mismatch repeats directly. We will also show that the elements SMART outputs are statistically much more significant than the output of the state-of-the-art tools. The full paper describing SMART appeared as [1].
Loading