Knapsack Schema Linking Agent for LLM-Based Text-to-SQL Generation

Zheng Yuan; Hao Chen; Zijin Hong; Qinggang Zhang; Feiran Huang; Xiao Huang

Knapsack Schema Linking Agent for LLM-Based Text-to-SQL Generation

Zheng Yuan, Hao Chen, Zijin Hong, Qinggang Zhang, Feiran Huang, Xiao Huang

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Text-to-SQL, Schema Linking

Abstract: Generating SQLs according to user queries (text-to-SQL) is a long-standing sequential challenge, where the accuracy of the initial schema linking significantly impacts the subsequent SQL generation performance. However, existing models often focus more on SQL generation and less on the schema linking task, leading to potential missing or redundant schema linking and suboptimal SQL generation performance. The underlying reason is that schema linking is not a simple selection problem but a \textbf{Knapsack problem}, which should consider both the \textit{value} of the schema linking in terms of missing important information and the \textit{weight} of the schema linking in terms of providing redundant information. Motivated by this, we provide two tailored SL benchmarks and two tailored metrics to train SL agents and to evaluate the missing and redundant schema linking. In this paper, we propose the \textbf{\underline{K}n\underline{a}psack \underline{S}chema \underline{L}inking \underline{A}gent (KaSLA)}, which can link the most valuable and least redundant schema element subsets for both tables and columns. KaSLA introduces an importance score function to predict each schema element's importance score, and then utilizes the importance score to estimate the value and the weight of each schema. Then, by estimating the capacity, the maximum weight the knapsack can hold, of a given user query from historical SQL records, KaSLA employs efficient dynamic programming to select the most valuable schema element set within the estimated capacity. Extensive experiments on two benchmark datasets demonstrate the superior performance of KaSLA over 12 state-of-the-art baselines. Especially on the popular and challenging BIRD benchmark, KaSLA can outperform the baselines by over 5.72\%.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4212

Loading