Enhancing Multi-modal Regular Expression Synthesis via Large Language Models and Semantic Manipulations of Sub-expressions

Zipan Tang, Yixuan Yan, Rongchen Li, Hanze Dong, Haiming Chen, Hongyu Gao

Published: 01 Jan 2024, Last Modified: 13 May 2025SETTA 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Real-world regular expressions (regexes) are widely used in practice. Since regexes are difficult to comprehend and write, automatically synthesizing regexes has been an important research problem. However, current techniques face challenges like weak capability in generalization and severely limited support of extended features of regexes. In this paper, we address these challenges via Large Language Models (LLMs) together with semantic manipulations of sub-expressions, and propose PowerSyn, a framework for regex synthesis based on both natural language descriptions and examples and supports extended features. Specifically, we design the prompt suitable for synthesizing regexes with LLMs and present novel algorithms to manipulate sub-expressions, which consider the matching relation, and are guided by examples. The evaluation results demonstrate the significant efficacy of our approach.