Abstract: We introduce the notion of frequency-constrained substring complexity. For any finite string, it counts the distinct substrings of the string per length and frequency class. For a string x of length n and a partition of [n] in \(\tau \) intervals, \(\mathcal {I}=I_1,\ldots ,I_\tau \), the frequency-constrained substring complexity of x is the function \(f_{x,\mathcal {I}}(i,j)\) that maps i, j to the number of distinct substrings of length i of x occurring at least \(\alpha _j\) and at most \(\beta _j\) times in x, where \(I_j=[\alpha _j,\beta _j]\). We extend this notion as follows. For a string x, a dictionary \(\mathcal {D}\) of d strings (documents), and a partition of [d] in \(\tau \) intervals \(I_1,\ldots ,I_\tau \), we define a 2D array \(S=S[1\mathinner {.\,.}|x|,1\mathinner {.\,.}\tau ]\) as follows: S[i, j] is the number of distinct substrings of length i of x occurring in at least \(\alpha _j\) and at most \(\beta _j\) documents, where \(I_j=[\alpha _j,\beta _j]\). Array S can thus be seen as the distribution of the substring complexity of x into \(\tau \) document frequency classes. We show that after a linear-time preprocessing of \(\mathcal {D}\), for any x and any partition of [d] in \(\tau \) intervals given online, array S can be computed in near-optimal \(\mathcal {O}(|x| \tau \log \log d)\) time.
Loading