Data-Centric Text-to-SQL with Large Language Models

Zezhou Huang; Shuo Zhang; Kechen Liu; Eugene Wu

Data-Centric Text-to-SQL with Large Language Models

Zezhou Huang, Shuo Zhang, Kechen Liu, Eugene Wu

Published: 10 Oct 2024, Last Modified: 24 Oct 2024TRL @ NeurIPS 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Text-to-SQL, data centric, large language models

Abstract: Text-to-SQL is crucial for enabling non-technical users to access data, and large language models have significantly improved performance. However, recent frameworks are largely Query-Centric, focusing on improving models' ability to translate natural language into SQL queries. Despite these advancements, real-world challenges—especially messy and large datasets—remain a major bottleneck. Our case studies reveal that 11-37\% of the ground truth answers in the BIRD benchmark are incorrect due to data quality issues (duplication, disguised missing values, data types and inconsistent values). To address this, we propose a Data-Centric Text-to-SQL framework that preprocesses and cleans data offline, builds a relationship graph between tables, and incorporates business logic. This allows LLM agents to efficiently retrieve relevant tables and details during query time, significantly improving accuracy. Our experiments show that this approach outperforms human-provided ground truth answers on the BIRD benchmarks by up to 33.89\%.

Submission Number: 24

Loading