Abstract: The context in which a substring appears is an important notion to iden-
tify – for example – its semantic meaning. However, existing definitions from
stringology fail to model the context explicitly. We introduce here xkcd -repeats,
a new family of repeats characterized by the number of different symbols at the
left and right of their occurrences. These repeats include as special extreme
cases the well-known classes of maximal and super-maximal repeats.
We give sufficient and necessary condition to bound their number linearly
in the size of the original string, and show an optimal algorithm that computes
them in linear time – given a suffix array –, independent of the size of the
alphabet, as well as two other algorithms that are faster in practice.
We extend this in two ways: first we show how to generalize the notion
of context. Instead of reducing it to the single symbol before and after an
occurrence, we propose the notion of unbounded context, while keeping linear
representation and computing time. Secondly, we provide a general framework
which allows to compute these (and other) repeats incrementally; opening the
door to streaming application.
0 Replies
Loading