IndicDB - Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages

Aviral Dawar; Roshan Karanth; Vikram Goyal; Dhruv Kumar

IndicDB - Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages

Aviral Dawar, Roshan Karanth, Vikram Goyal, Dhruv Kumar

Published: 01 Jun 2026, Last Modified: 01 Jun 2026Culture x AI 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Text-to-SQL, Multilingual NLP, Indic Languages, LLM Agents, Benchmark, Indian Administrative Data, Natural Language Interfaces, Database Reasoning, Synthetic Data Generation, Cross-Lingual Evaluation

TL;DR: IndicDB is a large-scale multilingual Text-to-SQL benchmark derived from Indian administrative data that utilizes a three-agent pipeline to highlight a persistent 9.00% performance gap between English and Indic languages.

Abstract: While Large Language Models (LLMs) have significantly advanced Text-to-SQL performance, existing benchmarks predominantly focus on Western contexts and simplified schemas, leaving a critical gap in real-world, non-Western applications. We present IndicDB, a comprehensive multilingual Text-to-SQL benchmark designed to evaluate cross-lingual semantic parsing across diverse Indic language families. The foundational relational schemas for IndicDB are sourced from primary open-data platforms, specifically the National Data and Analytics Platform (NDAP, https://ndap.niti.gov.in/) and the India Data Portal (IDP, https://indiadataportal.com/), ensuring the benchmark reflects the structural complexity of real-world administrative data. IndicDB comprises 20 databases across 237 tables. To transform denormalized government data into complex relational structures, we employ an iterative three-agent judge pattern (Architect, Auditor, and Refiner) to ensure structural rigor and high relational density, with multiple tables per database and join depths of up to six. The methodology follows a value-aware, difficulty-calibrated, and join-enforced pipeline to systematically synthesize tasks across English, Hindi, and five additional Indic languages. We evaluate cross-lingual semantic parsing performance of state-of-the-art models, including DeepSeek v3.2, MiniMax 2.7, LLaMA 3.3, and Qwen3, across seven linguistic variants to establish comprehensive baselines. Our results reveal a consistent global performance drop from English to Indic variants, highlighting a persistent "Indic Gap" driven by increased schema-linking difficulty, greater structural ambiguity in mapping Indic languages to SQL, and lack of external knowledge. IndicDB serves as a rigorous "pressure test'' for cross-lingual Text-to-SQL synthesis and semantic parsing in linguistically diverse environments. The code and benchmark are publicly available at: https://anonymous.4open.science/r/multilingualText2Sql-Indic--DDCC/.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 19

Loading