SynQL: Synthetic Data Generation for In-Domain, Low-Resource Text-to-SQL Parsing

Denver Baumgartner; Tomasz Kornuta

SynQL: Synthetic Data Generation for In-Domain, Low-Resource Text-to-SQL Parsing

Denver Baumgartner, Tomasz Kornuta

Published: 10 Oct 2024, Last Modified: 28 Oct 2024TRL @ NeurIPS 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Text-to-SQL, Synthetic Data Generation, Low-Resource Scenarios, In-Domain, In-Context Learning, Fine-Tuning

TL;DR: We introduce SynQL, a novel method for generating diverse synthetic data for text-to-SQL parsing in in-domain, low-resource scenarios

Abstract: We address the challenge of generating high-quality data for text-to-SQL parsing in low-resource, in-domain scenarios. Although leveraging large language models (LLMs) and in-context learning often achieves the best results in research settings, it is frequently impractical for real-world applications. Therefore, fine-tuning smaller, domain-specific models provides a viable alternative. However, the scarcity of training data frequently constrains it. To overcome this, we introduce SynQL, a novel method for synthetic text-to-SQL data generation tailored for in-domain contexts. We demonstrate the effectiveness of SynQL on the KaggleDBQA benchmark, showing significant performance improvements over models fine-tuned on original data. Additionally, we validate our method on the out-of-domain Spider dataset. We open-source the method and both synthetic datasets.

Submission Number: 60

Loading