Octopus: An Auto-Generated Multidimensional Fine-Grained Benchmark for Evaluating Text-to-SQL Systems

Xiang Li; Yu Haining; Chuanyi Liu; Xiao Sun; yikai hou; Peiyi Han; Shaoming Duan; Yizheng Yang; huangwenjie; Wenting Zhang; Zhichao Liu; yiqing zhang; Sun Yinggang; Ziming Guo; Dongyang Zhan; Hongli Zhang; Liang Yan; yingwei liang; Xiaohua Jia

Octopus: An Auto-Generated Multidimensional Fine-Grained Benchmark for Evaluating Text-to-SQL Systems

Xiang Li, Yu Haining, Chuanyi Liu, Xiao Sun, yikai hou, Peiyi Han, Shaoming Duan, Yizheng Yang, huangwenjie, Wenting Zhang, Zhichao Liu, yiqing zhang, Sun Yinggang, Ziming Guo, Dongyang Zhan, Hongli Zhang, Liang Yan, yingwei liang, Xiaohua Jia

20 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Text-to-SQL, Benchmark, Large Language Model

Abstract: Text-to-SQL is to convert natural language queries into structured SQLs, facilitating user interaction with databases without any SQL knowledge. The advent of LLM technologies significantly accelerates the text-to-SQL development. It is important to construct an appropriate benchmark to evaluate the performance of text-to-SQL models. However, existing text-to-SQL benchmarks are mainly produced by human annotations and suffer from limitations of low SQL complexity, single questioning mode, and low scalability. To address these limitations, we present a new multidimensional text-to-SQL benchmark, called OCTOPUS, which contains comprehensive evaluation metrics and fully auto-generated datasets. OCTOPUS has 9 first-level metrics and 18 second-level metrics from four dimensions to evaluate the performance of text-to-SQL systems, including accuracy, robustness, interactivity, and generalization. To help the benchmark construction, we also propose a series of fully automatic text-to-SQL data generation methods, which reduce human involvement, improve efficiency, and support higher scalability. OCTOPUS consists of 10,885 complex question-SQL pairs and 10,874 multi-turn dialogues over 74 public databases. We evaluate state-of-art text-to-SQL models on OCTOPUS and find they have unsatisfactory performance in all testing metrics and are still far from practical applications. OCTOPUS can be used to enhance the accuracy and utility of text-to-SQL models.

Primary Area: datasets and benchmarks

Submission Number: 25382

Loading