Abstract: The development of Spider (Yu et al., 2018), a large-scale dataset with complex programs and databases from several domains, has led to much progress in text-to-SQL semantic parsing. However, recent work has shown that models trained on Spider often struggle to generalize, even when faced with small perturbations of previously seen expressions. This is mainly due to the linguistic form of questions in Spider which are overly specific, unnatural, and display limited variation. In this work, we use data augmentation to enhance the robustness of text-to-SQL parsers against natural language variations. Existing approaches generate question reformulations either via models trained on Spider or only introduce local changes. In contrast, we leverage the capabilities of large language models to generate more realistic and diverse questions. Using only a few prompts, we achieve a two-fold increase in the number of questions in Spider. Training on this augmented dataset yields substantial improvements on a range of evaluation sets, including robustness benchmarks and out-of-domain data.
Paper Type: long
Research Area: Semantics: Sentence-level Semantics, Textual Inference and Other areas
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.
0 Replies
Loading