Rethinking Text-to-SQL: Dynamic Multi-turn SQL Interaction for Real-world Database Exploration

Rethinking Text-to-SQL: Dynamic Multi-turn SQL Interaction for Real-world Database Exploration

ACL ARR 2026 January Submission3652 Authors

04 Jan 2026 (modified: 07 Jun 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Text-to-SQL, Dynamic Interaction, Data Synthesis, Benchmark

Abstract: Recent advancements in Large Language Models (LLMs) have revolutionized Text-to-SQL parsing, achieving remarkable success in static, single-turn query generation. However, a significant disparity remains between these academic benchmarks and real-world utility. In practical applications, such as financial auditing or business analytics, user intents are rarely static; they evolve dynamically through iterative refinement, necessitating not just information retrieval (SELECT) but continuous state manipulation (INSERT, UPDATE, DELETE). To bridge this gap, we introduce DySQL-Bench, a novel benchmark designed to rigorously evaluate LLMs within a dynamic interaction framework. Unlike varying manual curation efforts, DySQL-Bench employs a two-stage automated synthesis pipeline: transforming raw relational schemas into hierarchical logic trees to generate user-database interactions, followed by a rigorous verify-and-refine protocol that ensures 100\% distinct correctness via human expert validation. We further propose an interactive evaluation environment simulating a triadic workflow involving an LLM-simulated user, the agent under test, and an executable database system. Spanning 13 diverse domains with 1,072 complex tasks, our experiments reveal that current powerful models struggle in this realistic setting. Notably, GPT-4o achieves only 58.34\% overall accuracy and a meager 23.81\% on the strict Pass\^{}5 metric, highlighting the substantial challenges DySQL-Bench poses for the future of database agents.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, evaluation methodologies, automatic creation and evaluation of language resources

Contribution Types: NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

Submission Number: 3652

Loading