Computing Minimal Absent Words and Extended Bispecial Factors with CDAWG Space

Published: 01 Jan 2024, Last Modified: 26 Jul 2025IWOCA 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: A string w is said to be a minimal absent word (MAW) for a string S if w does not occur in S and any proper substring of w occurs in S. We focus on non-trivial MAWs which are of length at least 2. Finding such non-trivial MAWs for a given string is motivated by applications in bioinformatics and data compression. Fujishige et al. [TCS 2023] proposed a data structure of size \(\varTheta (n)\) that can output the set \(\textsf{MAW}(S)\) of all MAWs for a given string S of length n in \(O(n + |\textsf{MAW}(S)|)\) time, based on the directed acyclic word graph (DAWG). In this paper, we present a more space efficient data structure based on the compact DAWG (CDAWG), which can output \(\textsf{MAW}(S)\) in \(O(|\textsf{MAW}(S)|)\) time with \(O(\textsf{e}_{\min })\) space, where \(|\textsf{MAW}(S)|\) denotes the cardinality of \(\textsf{MAW}(S)\) and \(\textsf{e}_{\min }\) denotes the minimum of the sizes of the CDAWGs for S and for its reversal \(S^R\). For any strings of length n, it holds that \(\textsf{e}_{\min }< 2n\), and for highly repetitive strings \(\textsf{e}_{\min }\) can be sublinear (up to logarithmic) in n. We also show that MAWs and their generalization minimal rare words have close relationships with extended bispecial factors, via the CDAWG.
Loading