LAIA-SQL: Enhancing Natural Language to SQL Generation in Multi-Table QA via Task Decomposition and Keyword Extraction

Jian Chen; Zhenyan Chen; Xuming Hu; Peilin Zhou; Yining Hua; Han Fang; Cissy Hing Yee Choy; Xinmei Ke; Jingfeng Luo; Zixuan Yuan

LAIA-SQL: Enhancing Natural Language to SQL Generation in Multi-Table QA via Task Decomposition and Keyword Extraction

Jian Chen, Zhenyan Chen, Xuming Hu, Peilin Zhou, Yining Hua, Han Fang, Cissy Hing Yee Choy, Xinmei Ke, Jingfeng Luo, Zixuan Yuan

27 Sept 2024 (modified: 13 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Natural Language Understanding, Text to SQL, Multi Table QA

Abstract: Natural Language to SQL (NL2SQL) provides an effective solution for multi-table question answering (Table QA) to automate data retrieval by transforming simple user queries into SQL commands. It enhances data accessibility and decision-making processes across various industries. Large Language Model (LLM) based NL2SQL methods have been shown to outperform rule-based or neural network-based NL2SQL methods. However, existing LLM-based NL2SQL approaches face challenges like inaccurate interpretation of user questions, slow retrieval speeds, erroneous SQL generation, and high operational costs. As there is a lack of datasets specifically designed to evaluate natural language understanding (NLU) in NL2SQL tasks and no models optimized for user question understanding in Table QA, we introduce LAIA-NLU, a novel dataset that dissects NLU into task decomposition and keyword extraction. LAIA-NLU contains 1,500 high-quality QA pairs, created through manual review. Using this dataset, we developed LAIA-NLUer, which is capable of effectively interpreting user intent in table-based queries. To further enhance NL2SQL performance in terms of speed, cost, and accuracy, we also present LAIA-SQL, a retrieval-augmented based NL2SQL framework. Experimental results show that LAIA-SQL outperforms state-of-the-art models, achieving an accuracy improvement to 67.28% in BIRD dataset, a 52.4% reduction in runtime, and a 97% decrease in operational costs. These improvements demonstrate the potential of our approach to advance multi-table data retrieval and analysis. Our code, dataset, and model will be publicly available to encourage further research in this field.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 10678

Loading