Importance of Synthesizing High-quality Data for Text-to-SQL Parsing

Yiyun Zhao; Jiarong Jiang; Yiqun Hu; Wuwei Lan; Henghui Zhu; Anuj Chauhan; Alexander Hanbo Li; Lin Pan; Jun Wang; Chung-Wei Hang; Sheng Zhang; Mingwen Dong; Joseph Lilien; Patrick Ng; Zhiguo Wang; Vittorio Castelli; Bing Xiang

Importance of Synthesizing High-quality Data for Text-to-SQL Parsing

Yiyun Zhao, Jiarong Jiang, Yiqun Hu, Wuwei Lan, Henghui Zhu, Anuj Chauhan, Alexander Hanbo Li, Lin Pan, Jun Wang, Chung-Wei Hang, Sheng Zhang, Mingwen Dong, Joseph Lilien, Patrick Ng, Zhiguo Wang, Vittorio Castelli, Bing Xiang

03 Oct 2022 (modified: 06 Jul 2025)Neurips 2022 SyntheticData4MLReaders: Everyone

Keywords: Natural language processing, semantic parsing, data synthesis, data augmentation, text-to-sql

TL;DR: We proposed a new data synthesis framework for text-to-SQL and designed an intermediate representation to bridge SQL-to-NLQ generation, which can further improve the state-of-the-art performance on Spider benchmark.

Abstract: There has been increasing interest in synthesizing data to improve downstream text-to-SQL tasks. In this paper, we examined the existing synthesized datasets and discovered that state-of-the-art text-to-SQL algorithms did not further improve on popular benchmarks when trained with augmented synthetic data. We observed two shortcomings: illogical synthetic SQL queries from independent column sampling and arbitrary table joins. To address these issues, we propose a novel synthesis framework that incorporates key relationships from schema, imposes strong typing, and conducts schema-distance-weighted column sampling. We also adopt an intermediate representation (IR) for the SQL-to-text task to further improve the quality of the generated natural language questions. When existing powerful semantic parsers are pre-finetuned on our high-quality synthesized data, these models have significant accuracy boosts and achieve new state-of-the-art performance on Spider.

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/importance-of-synthesizing-high-quality-data/code)

5 Replies

Loading