The Role of Synthetic Data in Multilingual, Multicultural AI Systems: Lessons from Indic languages

Pranjal A Chitale; Varun Gumma; Sanchit Ahuja; Prashant Kodali; Manan Uppadhyay; Deepthi Sudharsan; Sunayana Sitaram

The Role of Synthetic Data in Multilingual, Multicultural AI Systems: Lessons from Indic languages

Pranjal A Chitale, Varun Gumma, Sanchit Ahuja, Prashant Kodali, Manan Uppadhyay, Deepthi Sudharsan, Sunayana Sitaram

17 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Synthetic Data, Multilingual & Multicultural NLP, Instruction Fine-Tuning Indian languages

TL;DR: We present Updesh, a synthetic instruction-tuning dataset for Indian languages built via a culturally grounded, bottom-up approach. Evaluations show limits of LLM-as-judge and highlight both the promise and constraints of synthetic multilingual data.

Abstract: Developing AI systems that operate effectively across languages while remaining culturally grounded is a long-standing challenge, particularly in low-resource settings. Synthetic data provides a promising avenue, yet its effectiveness in multilingual and multicultural contexts remains underexplored. We investigate the creation and impact of synthetic, culturally contextualized datasets for Indian languages through a bottom-up generation strategy that prompts large open-source LLMs ($\geq 235$B parameters) to ground data generation in language-specific Wikipedia content. This approach complements the dominant top-down paradigm of translating synthetic datasets from high-resource languages such as English. We introduce $Updesh$, a high-quality large-scale synthetic instruction-following dataset comprising 9.5M data points across 13 Indian languages, encompassing diverse reasoning and generative tasks with an emphasis on enhancing long-context and multi-turn capabilities, in addition to improving alignment with Indian cultural contexts. A comprehensive evaluation incorporating both automated metrics and human annotation across 10k assessments indicates that the generated data is of high quality; however, human evaluation highlights specific areas for further improvement. Additionally, we perform downstream evaluations by fine-tuning models on our dataset and assessing their performance across 15 diverse multilingual datasets to evaluate the generalizability of our dataset. Our experiments demonstrate that models trained on $Updesh$ consistently obtain significant performance improvements on generative tasks and remain competitive on multiple-choice evaluations in NLU tasks. Further, the relative improvements of models fine-tuned with $Updesh$ are most pronounced in low and medium-resource languages, effectively narrowing the gap between these languages and high-resource languages. These findings provide empirical evidence that effective multilingual AI development requires multi-faceted data curation and generation strategies that include context-aware, culturally grounded methodologies.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 8977

Loading