Text-to-SQL domain adaptation via automated benchmark transformation

Text-to-SQL domain adaptation via automated benchmark transformation

ICLR 2026 Conference Submission20738 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: text-to-sql, Large Language Models, Benchmark, domain

TL;DR: We introduce an automated benchmark-transformation technique for producing high-quality text-to-SQL benchmarks for specialized domains, with no domain expertise, and show that it can be used to vastly improve text-to-SQL accuracy.

Abstract: Text-to-SQL (Text2SQL) systems democratize access to structured databases by enabling users to retrieve data via natural language. Public Text2SQL benchmarks such as Spider and BIRD, which were introduced to measure the accuracy of Text2SQL systems, have been instrumental in driving their continual improvement. However, a Text2SQL system that performs well on a public benchmark may perform poorly when incorporated into industry applications that access proprietary databases with domain-specific vocabulary, rules and schema. While prompt-tuning or model fine-tuning techniques can significantly improve the performance of LLM-based Text2SQL systems, they require benchmarks tailored to the domain of interest. Producing a domain-specific benchmark typically requires a great deal of human expertise and labor. We introduce an automated method that generates domain-specific Text2SQL benchmarks by translating ground-truth question-SQL pairs from existing public datasets to a target domain. We apply our technique to the Spider and Bird benchmarks to produce benchmarks for two target domains: asset management and patient health care. For both domains, the accuracy of various popular Text2SQL systems is typically less than half of what it is for Spider or Bird. Fine-tuning LLM models on our generated datasets leads to substantial accuracy improvements: they can exceed frontier models by up to 35\% for asset management and up to 11\% for health care. Delving into why our accuracy improvements are domain-dependent, we introduce a dataset distance metric that qualitatively correlates with the degree of improvement. In essence, our benchmark transformation technique leverages the substantial human effort expended to produce existing public benchmarks, obviating the need to repeat such effort for each domain. We plan to make our transformed benchmarks available to the research community.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 20738

Loading