Octopus: An Auto-Generated Multidimensional Fine-Grained Benchmark for Evaluating Text-to-SQL Systems

ICLR 2026 Conference Submission25382 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Text-to-SQL, Benchmark, Large Language Model
Abstract: Text-to-SQL is to convert natural language queries into structured SQLs, facilitating user interaction with databases without any SQL knowledge. The advent of LLM technologies significantly accelerates the text-to-SQL development. It is important to construct an appropriate benchmark to evaluate the performance of text-to-SQL models. However, existing text-to-SQL benchmarks are mainly produced by human annotations and suffer from limitations of low SQL complexity, single questioning mode, and low scalability. To address these limitations, we present a new multidimensional text-to-SQL benchmark, called OCTOPUS, which contains comprehensive evaluation metrics and fully auto-generated datasets. OCTOPUS has 9 first-level metrics and 18 second-level metrics from four dimensions to evaluate the performance of text-to-SQL systems, including accuracy, robustness, interactivity, and generalization. To help the benchmark construction, we also propose a series of fully automatic text-to-SQL data generation methods, which reduce human involvement, improve efficiency, and support higher scalability. OCTOPUS consists of 10,885 complex question-SQL pairs and 10,874 multi-turn dialogues over 74 public databases. We evaluate state-of-art text-to-SQL models on OCTOPUS and find they have unsatisfactory performance in all testing metrics and are still far from practical applications. OCTOPUS can be used to enhance the accuracy and utility of text-to-SQL models.
Primary Area: datasets and benchmarks
Submission Number: 25382
Loading