Towards Bringing Parity in Pretraining Datasets for Low-resource Indian Languages

Kaushal Santosh Bhogale, Deovrat Mehendale, Tahir Javed, Devbrat Anuragi, Sakshi Joshi, Sai Sundaresan, Aparna Ananthanarayanan, Sharmistha Dey, Sathish Kumar Reddy G, Anusha Srinivasan, Abhigyan Raman, Pratyush Kumar, Mitesh M. Khapra

Published: 01 Jan 2025, Last Modified: 30 Jul 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Lack of large-scale pretraining data for low resource languages from the Indian sub-continent, leads to their underrepresentation in existing massively multilingual models. In this work, we address this gap by proposing a framework to create large raw audio datasets for such under-represented languages by collating publicly accessible audio content. Leveraging this framework, we present MahaDhwani, a corpus comprising 279K hours of raw audio across 22 Indian languages. To test the utility of MahaDhwani, we pretrain a conformer style model, and then further finetune it to build a multilingual ASR model supporting the 22 languages. Using a hybrid multi-softmax decoder, we balance the benefit of shared parameters which enable crosslingual transfer, and the benefit of dedicated capacity for each language. Our evaluations on the IndicVoices benchmark show the benefits of pre-training, particularly in low-resource settings. We will open-source our framework, code and scripts to reproduce the dataset.