Improving Complex SQL Generation for Text-to-SQL by Addressing Semantic Blind Spots in Pending SQL Components
Keywords: Large Language Models, Text-to-SQL, Reinforcement Learning, Group Relative Policy Optimization, Abstract Syntax Tree
Abstract: In recent years, significant advancements in large language models (LLMs) have greatly propelled the development of Text-to-SQL tasks. However, due to the token-by-token sequential generation mechanism employed by these models, they encounter a “semantic blind spot” problem with respect to pending SQL components—the parts of the SQL query yet to be generated. Specifically, language models are unable to effectively utilize the semantic information of these pending SQL components during the generation of the final SQL query, which poses considerable challenges for generating complex SQL statements. To address this issue, we propose a novel thought process based on SQL components pre-generation and design a maximum connected subtree matching reward mechanism leveraging the SQL abstract syntax tree (AST) to improve the accuracy of local component generation. Extensive experiments demonstrate that, under comparable model parameter scales, our training approach achieves significant advantages, effectively enhancing the generation of complex SQL queries. Our method attains an execution accuracy (EX) of 65.78% on the BIRD-dev dataset and achieves state-of-the-art (SOTA) performance on the Spider-syn datasets.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 23684
Loading