SQL-GEN: Bridging the Dialect Gap for Text-to-SQL Via Synthetic Data And Model Merging

Mohammadreza Pourreza; Ruoxi Sun; Hailong Li; Lesly Miculicich; Tomas Pfister; Sercan O Arik

SQL-GEN: Bridging the Dialect Gap for Text-to-SQL Via Synthetic Data And Model Merging

Mohammadreza Pourreza, Ruoxi Sun, Hailong Li, Lesly Miculicich, Tomas Pfister, Sercan O Arik

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Text-to-SQL, LLM, Synthetic Data Generation, Databases

TL;DR: SQL-GEN is a new framework that generates synthetic training data to improve text-to-SQL systems for different SQL dialects. Additionally, SQL-GEN's MoE initialization method helps to unify multi-dialect capabilities in a single system.

Abstract: Text-to-SQL systems, that convert natural language queries into SQL programs, have seen significant progress with recent breakthroughs. However, these have been primarily for the SQLite dialect and adapting Text-to-SQL systems to other SQL dialects like BigQuery and PostgreSQL remains a challenge due to the diversity in SQL syntaxes and functions, along with the high cost of collecting and curating SQL-specific training data. To this end, we introduce SQL-GEN,a framework for generating high-quality synthetic data for any dialect guided by dialect-specific tutorials. We demonstrate the effectiveness of SQL-GEN in creating training data to significantly improve the downstream Text-to-SQL performance for other dialects – it improves the execution accuracy by up to 20% over previous methods, and reduces the gap with large-scale human-annotated data on unseen real world multi-dialect benchmarks. Moreover, combining our synthetic data with human-annotated data provides additional performance boosts up to 5.6%. Towards unifying the multi-dialect capability in a single system, we also introduce a novel Mixture of Experts (MoE) initialization method that integrates dialect-specific models by merging self-attention layers and initializing the gates with dialect-specific keywords, yielding one unified and versatile model adept for multiple SQL dialects, further enhancing performance across different SQL dialects. By leveraging shared core features of multiple dialect-specific models, our MOE demonstrated superior performance compared with models trained on individual dialects alone

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 10433

Loading