Schema-Refiner: Synergizing Knowledge Graphs and LLMs for Proactive Schema Refinement in Text-to-SQL

Jiaqian Wang; Yutao Qi; Wenjin Hou; Yu Pang; Rui Yang

Schema-Refiner: Synergizing Knowledge Graphs and LLMs for Proactive Schema Refinement in Text-to-SQL

Jiaqian Wang, Yutao Qi, Wenjin Hou, Yu Pang, Rui Yang

18 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Text-to-SQL, Schema Ambiguity, Schema Refinement, Neuro-Symbolic Methods, Knowledge Graph

TL;DR: We introduce a framework that automatically cleans up and standardizes ambiguous database schemas, making Text-to-SQL systems more robust and accurate.

Abstract: Text-to-SQL broadens data access but often underperforms in real-world databases. We trace a key cause to lexical–-schema ambiguity, which manifests as homonymy, synonymy, and irregular naming that obscure the mapping between user utterances and schema elements, leading models astray. Prior work primarily deploys interactive, downstream fixes (e.g., robust decoding or clarification), which do not resolve the schema’s intrinsic ambiguity. To mitigate this challenge, we propose Schema-Refiner, a neuro-symbolic framework that (i) builds a schema knowledge graph, (ii) applies community detection to recover column-level context, (iii) uses a large language model to infer canonical semantics, and (iv) synthesizes CREATE VIEW statements, exposing a standardized, disambiguated logical schema layer for direct consumption by downstream Text-to-SQL models. This layer leaves the database untouched; a rule-based rewriter maps queries over the views into equivalent SQL on the original schema. To evaluate robustness to lexical–schema ambiguity and the effectiveness of our approach, we construct Amb-Spider by injecting ambiguities into Spider with human-verified annotations. Across multiple state-of-the-art Text-to-SQL systems, Amb-Spider consistently reduces execution accuracy. When paired with Schema-Refiner, these systems better detect ambiguities and regain a large share of the lost accuracy.

Supplementary Material: zip

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Submission Number: 10632

Loading