Abstract: Motifs are relatively short sequences that are biologically significant, and their discovery in molecular sequences is a well-researched subject. A don’t care is a special letter that matches every letter in the alphabet. Formally, a motif is a sequence of letters of the alphabet and don’t care letters. A motif \(\tilde{m}_{d,k}\) that occurs at least k times in a sequence is maximal if it cannot be extended (to the left or right) nor can it be specialised (that is, its \(d' \le d\) don’t cares cannot be replaced with letters from the alphabet) without reducing its number of occurrences. Here we present a new dynamic data structure, and the first on-line algorithm, to discover all maximal motifs in a sliding window of length \(\ell \) on a sequence x of length n in \(\mathcal {O}(nd\ell + d\lceil \frac{\ell }{w}\rceil \cdot \sum _{i = \ell }^{n-1} |{\textsc {diff}}_{i-1}^{i}|)\) time, where w is the size of the machine word and \({\textsc {diff}}_{i-1}^{i}\) is the symmetric difference of the sets of occurrences of maximal motifs at \(x[i-\ell \mathinner {.\,.}i-1]\) and at \(x[i-\ell +1 \mathinner {.\,.}i]\).
Loading