Beyond Text-to-SQL: Can LLMs Really Debug Enterprise SQL?

Jing Ye; Yiwen Duan; Yonghong Yu; Victor Ma; Gaoyang; Xing Chen

Beyond Text-to-SQL: Can LLMs Really Debug Enterprise SQL?

Jing Ye, Yiwen Duan, Yonghong Yu, Victor Ma, Gaoyang, Xing Chen

02 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: SWE, LLM, benchmark, SQL, Bugfixing, Agent

Abstract: SQL is the core of data analysis and engineering across industries, powering large-scale workflows for data extraction, transformation, and loading. However, in enterprise-level scenarios, it is challenging to generate fully correct SQL code in a single attempt—even for experienced developers or advanced \ttsql LLMs. Multiple iterations of debugging are usually required, yet LLMs often get lost in multi-turn correction. To address this gap, we introduce \ourbench, the first benchmark designed for enterprise-level SQL reasoning and debugging. Our benchmark is built upon two key innovations: (1) an automated construction workflow that employs reverse engineering to systematically inject realistic bugs into large-scale SQL code, enabling scalable and diverse benchmark generation; and (2) an \textbf{execution-free evaluation framework} tailored for enterprise settings, providing fast, accurate, and resource-efficient assessment. \ourbench comprises 469 \ourbenchsyn queries featuring syntax errors with explicit error messages, and 516 \ourbenchsem queries targeting semantic errors where SQL fails to meet the requirement. The SQLs are substantially complex, averaging over 140 lines with abstract syntax trees of high complexity (average width >11, depth >8.7). We evaluate nearly 30 LLMs on \ourbench. Even state-of-the-art reasoning models struggle: Claude-4-Sonnet achieves only 36.46\% success on \ourbenchsyn and 32.17\% on \ourbenchsem. Most models fail to reach 20\% success, underscoring the significant gap between current LLM capabilities and the demands of enterprise SQL debugging. To bridge this gap, we systematically \textbf{explore four potential solution strategies and conduct extensive experiments} to evaluate and compare their effectiveness. Our experiments not only highlight the challenges but also identify effective strategies for advancing SQL debugging with LLMs.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 916

Loading