SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY-SA 4.0
Keywords: text-to-SQL, LLM, SQL issues
Abstract: Resolution of complex SQL issues persists as a significant bottleneck in real-world database applications. Current Large Language Models (LLMs), while adept at text-to-SQL translation, have not been rigorously evaluated on the more challenging task of debugging on SQL issues. In order to address this gap, we introduce **BIRD-CRITIC**, a new SQL issue debugging benchmark comprising 530 carefully curated PostgreSQL tasks (**BIRD-CRITIC-PG**) and 570 multi-dialect tasks (**BIRD-CRITIC-Multi**), which are distilled from authentic user issues and replayed within new environments to facilitate rigorous and contamination-free evaluation. Baseline evaluations on BIRD-CRITIC underscore the task's complexity, with the leading reasoning model **O3-Mini** achieving only 38.87% success rate on **BIRD-CRITIC-PG** and 33.33% on **BIRD-CRITIC-Multi**. Meanwhile, realizing open-source models for database tasks is crucial which can empower local development while safeguarding data privacy. Therefore, we present **Six-Gym** (**S**ql-f**IX**-Gym), a training environment for elevating the capabilities of open-source models specifically for SQL issue debugging. This environment leverages **SQL-Rewind** strategy, which automatically generates executable issue-solution datasets by reverse-engineering issues from verified SQLs. However, popular trajectory-based fine-tuning methods do not explore substantial supervisory signals. We further propose *f*-Plan Boosting, which extracts high-level debugging plans automatically from SQL solutions, enabling the teacher LLMs to harvest and produce 73.7% more successful trajectories for training. We integrate these components into an open-source agent, **BIRD-Fixer**. Based on Qwen-2.5-Coder-14B, **BIRD-Fixer** raises its success rate to 38.11% on **BIRD-CRITIC-PG** and 29.65% on **BIRD-CRITIC-Multi**, surpassing many leading proprietary models such as Claude-3.7-Sonnet and GPT-4.1, marking a significant step toward democratizing sophisticated SQL-debugging capabilities for both research and industry.
Supplementary Material: zip
Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)
Submission Number: 21257
Loading