Keywords: Text2SQL, Multi-Agent, Semantic Validation, Large Language Models, Benchmark Curation, Gold Errors
Abstract: While Large Language Models have significantly advanced Text2SQL generation, a critical semantic gap persists where syntactically valid queries often misinterpret user intent. To mitigate this challenge, we propose GBV-SQL, a novel multi-agent framework that introduces Guided Generation with SQL2Text Back-translation Validation. This mechanism uses a specialized agent to translate the generated SQL back into natural language, which verifies its logical alignment with the original question. Critically, our investigation reveals that current evaluation is undermined by a systemic issue: the poor quality of the benchmarks themselves. We introduce a formal typology for ''Gold Errors'', which are pervasive flaws in the ground-truth data, and demonstrate how they obscure true model performance. On the challenging BIRD benchmark, GBV-SQL achieves 63.23\% execution accuracy, a 5.8\% absolute improvement. After removing flawed examples from Spider and repairing flawed examples in BIRD, GBV-SQL achieves 96.5\% (dev) and 97.6\% (test) execution accuracy on Spider, and 90.42\% on the corrected BIRD dataset. Our work offers both a robust framework for semantic validation and a critical perspective on benchmark integrity, highlighting the need for more rigorous dataset curation.
Paper Type: Long
Research Area: Natural Language Generation
Research Area Keywords: NLP Applications,Machine Learning for NLP,LLM Agents
Contribution Types: NLP engineering experiment
Languages Studied: english,sql
Submission Number: 2247
Loading