An Efficient Algorithm for Regular Expression Matching Using Variable-length-gram Inverted Index

Published: 01 Jan 2024, Last Modified: 06 Feb 2025DASFAA (2) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Regular expression (regex) matching is widely used in many applications, such as code searching, entity extraction, and intrusion detection, which requires efficient matching efficiency. The traditional approaches utilize the finite state automaton to match the regex query from a text, even though they can employ some filtering strategies to avoid irrelevant characters that cannot be the matching results, there are still large numbers of contents that need to be verified by the automaton. Recent methods use the positional inverted index based on q-gram (q-length substring) to match all results for the regex query, which avoids the time-consuming automaton-based verification. However, using the fixed length of substrings (i.e., q-grams) to index the positions of the text could result in the high frequently occurred q-grams being used for the regex matching, finally limiting the matching efficiency. To this end, we employ a variable-length gram technique to boost the index-based regex matching efficiency. At first, we build the positional inverted index based on the variable-length grams so that a better balance is obtained between the number of grams and the number of gram occurrences on the text. Then, we design a data structure (VGgraph) based on variable-length grams to represent the regex query and propose the VGgraph-based matching algorithm using the variable-length-gram inverted index. Although computing the optimal VGgraph with the minimal matching cost is NP-hard, we propose a greedy algorithm to construct the VGgraph which obtains a \(\ln {n}\) approximation to the optimal VGgraph. Extensive experiments on real-world datasets demonstrate that the variable-length gram technique can significantly improve the efficiency of index-based regex matching.
Loading