MILA (MULTILINGUAL INDIC LANGUAGE ARCHIVE): A DATASET FOR EQUITABLE MULTILINGUAL LLMS

Piyush Sawarkar; kundeshwar Vijayrao pundalik; Ajay Nagpal; Viraj Thakur; Mohd Nauman; Shyam Pawar; Nihar Ranjan Sahoo; Abhishek Shinde; Vijay Devane; Sravan Gorugantu; Atharv Savarkar; Vijay Balsubramaniam; Aamod Thakur; Siddhesh Dosi; Bhargav Patel; Yogeshkumar Sant; Aniket Mahendra Wakchoure; Sunil Patra; Archana Amberkar; Nitish Kamal Singh; Neha Maurya; Nivedya Samikutty; Vishal Kumar Mishra; Pankaj Singh; Rohit Saluja; Maunendra Sankar Desarkar; Ganesh Ramakrishnan

MILA (MULTILINGUAL INDIC LANGUAGE ARCHIVE): A DATASET FOR EQUITABLE MULTILINGUAL LLMS

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models (LLMs), Multilingual NLP, Low-resource languages, Indic languages, Data curation, OCR for Indic scripts, Synthetic data generation, Translation pipelines, Data distillation, Benchmarking (Indic-MMLU), Domain-specific corpora, Continual pretraining, Equitable multilingual modeling, Inclusive language technologies

Abstract: Large Language Models (LLMs) are structurally biased toward high-resource languages like English due to corpus skew, a problem particularly severe for Indic languages. To address this deficit, we introduce ***MILA***, the largest expert-curated Indic corpus to date, comprising ***7.5 trillion tokens*** across ***16 scheduled Indic languages*** and English. MILA is constructed via a multi-stage data engineering pipeline that integrates large-scale ***web acquisition***, script-sensitive ***OCR*** for under-digitized Indic writing systems, LLM-assisted post-correction for ***Translation*** fidelity, and ***targeted data distillation*** through the ***Indic-Persona Hub***. The pipeline further incorporates ***synthetic augmentation and rewriting***, followed by stringent ***quality, toxicity, language, and deduplication filtering***, and culminates in ***human-in-the-loop linguistic*** and cultural validation with comprehensive ***PII redaction*** and ensuring ***downstream task and benchmark-based decontamination***. This pipeline yields a distributionally stable, contamination-controlled, high-fidelity pretraining substrate. Alongside, we release ***Indic-MMLU***, a translated and verified adaptation of MMLU into 16 Indian languages, offering the first large-scale Indic multilingual benchmark for assessing LLMs and their extent of cross-lingual knowledge transfer. We further propose a ***Parity-based fairness Metric*** capturing cross-lingual performance asymmetries relative to English. Comprehensive experiments including controlled ablations of translation quality, OCR incorporation, synthetic SFT generation, and continual pretraining demonstrates that models trained on MILA achieve substantial gains on Indic-MMLU and materially narrow cross-lingual disparities. Collectively, MILA, Indic-MMLU, and the associated validation protocols establish a scalable foundation for equitable multilingual modeling in the Indic context. All resources are released anonymously for reproducibility.

Primary Area: datasets and benchmarks

Submission Number: 24079

Loading