Text-to-SQL Benchmarks for Enterprise Realities: Under Massive Scopes, Complex Schemas and Scattered Knowledge
Keywords: Large Language Model, Text-to-SQL Benchmark, Code Generation, NLP Application
Abstract: Existing Text-to-SQL benchmarks remain overly idealized and differ substantially from enterprise scenarios, which require retrieving tables from massive query scopes, interpreting complex schemas, and locating scattered knowledge across large collections of documents. To address these gaps, we present two enterprise benchmarks, BIRD-Ent and Spider-Ent, constructed through a cost-effective refinement framework applied to their academic counterparts (BIRD and Spider), together with a new task paradigm, Dual-Retrieval-Augmented-Generation (DRAG) Text-to-SQL, which formalizes the dual-retrieval workflow of table schemas and knowledge documents prior to SQL generation. Our benchmarks exhibit three defining characteristics of enterprise settings: massive query scopes with over 4,000 columns, complex schemas with domain-specific and heavily abbreviated table and column names, and scattered knowledge distributed across enterprise-style documents totaling 1.5M tokens. These properties make the benchmarks substantially more realistic and challenging than existing ones. Evaluation on several state-of-the-art large language models (LLMs) reveals a sharp performance drop, with only 39.1 EX on BIRD-Ent and 60.5 EX on Spider-Ent, underscoring the gap between academic performance and enterprise requirements. By providing a rigorous and discriminative testbed under the DRAG Text-to-SQL paradigm, our benchmarks offer a valuable resource to advance research toward Text-to-SQL systems that are reliable and deployable in real-world enterprise environments.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 8734
Loading