Efficient Index-Based Regular Expression Matching with Optimal Query Plan Tree

Published: 2023, Last Modified: 06 Feb 2025DASFAA (1) 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The problem of matching a regular expression (regex) on a text exists in many applications such as entity matching, protein sequences matching, and shell commands. Classical methods to support regex matching usually adopt the finite automaton which has a high matching cost. Recent methods solve the regex matching problem by utilizing the positional q-gram inverted index – one of the most widely used index schemes, and all matching results can be matched directly based on this index. The efficiency of these methods depends critically on the query plan tree, which is built from the query with some heuristic rules. However, these methods could become inefficient when an improper rule is used for building the query plan tree. To remedy this issue, this paper aims to build a good query plan tree with an efficiency guarantee. We propose a novel method to build an optimal query plan tree with the minimal expected matching cost for the index-based regex matching method. While computing an optimal query plan tree is an NP-hard problem even with strong assumptions, we propose a pseudo-polynomial time algorithm to build an optimal query plan tree. Finally, extensive experiments have been conducted on real-world data sets and the results show that our method outperforms state-of-the-art methods.
Loading