Keywords: LLM, Prompt Engineering, NL2SQL Evaluation
Abstract: Current NL2SQL evaluation relies heavily on execution accuracy (EX), which measures correctness by comparing query results against ground truth at the string level. While effective for traditional supervised models that produce uniform outputs, this metric proves inadequate in the LLM era, where diverse yet semantically equivalent SQL queries can correctly answer the same natural language question. To address this limitation, we investigate LLM-based evaluation for NL2SQL tasks and propose a rule generation-enhancing framework.
It leverages a training dataset with annotated correctness labels through a three-step learning process: data clustering, intra-cluster rule summarization and refinement, and inter-cluster rule aggregation. So the model learn from labeled data through evaluation rule synthesis rather than parameter updates. The generated rules are integrated into LLM evaluation prompts during testing. We conduct experiments across three datasets, covering three evaluation scenarios: (1) identifying semantically correct predictions that differ in execution results from reference SQL, (2) distinguishing functionally different SQL queries that produce identical execution results, and (3) evaluating generated SQL correctness in the absence of reference queries. Our results demonstrate that traditional EX metrics show poor alignment with human annotations, while LLMs exhibit strong potential for this evaluation task. Our rule-generation framework consistently enhances LLMs' performance across all datasets and model variants. It effectively learns dataset-specific evaluation rules, and these learned rules can be successfully transferred to smaller models to improve their evaluation capabilities.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 19134
Loading