Keywords: enterprise text-to-sql, benchmarking
Abstract: Existing text-to-SQL benchmarks have largely been constructed from web tables with human-generated question-SQL pairs. LLMs typically show strong results on these benchmarks, leading to a belief that LLMs are effective at text-to-SQL tasks. However, how these results transfer to enterprise settings is unclear because tables in enterprise databases might differ substantially from web tables in structure and content. To address this gap, we introduce a new dataset BEAVER, the first *private enterprise* text-to-SQL benchmark. This dataset includes natural language queries and SQL statements collected from real query logs. Experimental results show that *off-the-shelf* LLMs struggle with this dataset. We identify three main reasons for the poor performance: (1) enterprise table schemas are more intricate than those found in public datasets, making SQL generation inherently more challenging; (2) business-oriented queries tend to be more complex, often involving multi-table joins, aggregations, and nested queries; (3) public LLMs cannot train on private enterprise data warehouses that are not publicly accessible, and therefore it is difficult for the model to learn to solve (1) and (2). We believe BEAVER will facilitate future research in building text-to-SQL systems that perform better in enterprise settings.
Include In Proceedings: No
Submission Number: 13
Loading