Keywords: Text-to-SQL, LLM, Synthetic Data Generation, Databases
TL;DR: SQL-GEN is a new framework that generates synthetic training data to improve text-to-SQL systems for different SQL dialects. Additionally, SQL-GEN's MoE initialization method helps to unify multi-dialect capabilities in a single system.
Abstract: Text-to-SQL systems, that convert natural language queries into SQL programs, have seen significant progress with recent breakthroughs. However, these have been primarily for the SQLite dialect and adapting Text-to-SQL systems to other SQL dialects like BigQuery and PostgreSQL remains a challenge due to the diversity in SQL syntaxes and functions, along with the high cost of collecting and curating SQL-specific training data. To this end, we introduce SQL-GEN,a framework for generating high-quality synthetic data for any dialect guided by dialect-specific tutorials. We demonstrate the effectiveness of SQL-GEN in creating training data to significantly improve the downstream Text-to-SQL performance for other dialects – it improves the execution accuracy by up to 20% over previous methods, and reduces the gap with large-scale human-annotated data on unseen real world multi-dialect benchmarks. Moreover, combining our synthetic data with human-annotated data provides additional performance boosts up to 5.6%. Towards unifying the multi-dialect capability in a single system, we also introduce a novel Mixture of Experts (MoE) initialization method that integrates dialect-specific models by merging self-attention layers and initializing the gates with dialect-specific keywords, yielding one unified and versatile model adept for multiple SQL dialects, further enhancing performance across different SQL dialects. By leveraging shared core features of multiple dialect-specific models, our MOE demonstrated superior performance compared with models trained on individual dialects alone
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 10433
Loading