RESPIN-S1.0: A read speech corpus of 10000+ hours in dialects of nine Indian Languages

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Automatic Speech Recognition, ASR Corpus, Indian Languages
TL;DR: We present RESPIN-S1.0, a large-scale, dialect-rich corpus of over 10,000 hours of read speech across nine major Indian languages: Bengali, Bhojpuri, Chhattisgarhi, Hindi, Kannada, Magahi, Maithili, Marathi, and Telugu.
Abstract: We introduce **RESPIN-S1.0**, the largest publicly available dialect-rich read-speech corpus for Indian languages, comprising more than 10,000 hours of validated audio across nine major languages: Bengali, Bhojpuri, Chhattisgarhi, Hindi, Kannada, Magahi, Maithili, Marathi, and Telugu. Indian languages exhibit high dialectal variation and are spoken by populations that remain digitally underserved. Existing speech corpora typically represent only standard dialects and lack domain and linguistic diversity. RESPIN-S1.0 addresses this limitation by collecting speech across more than 38 dialects and two high-impact domains: agriculture and finance. Text data were composed by native dialect speakers and validated through a pipeline combining automated and manual checks. Over 200,000 unique sentences were recorded through a crowdsourced mobile platform and categorised into clean, semi-noisy, and noisy subsets based on transcription quality, with the clean portion alone exceeding 10,000 hours. Along with audio and transcriptions, RESPIN provides dialect-aware phonetic lexicons, speaker metadata, and reproducible train, development, and test splits. To benchmark performance, we evaluate multiple ASR models, including TDNN-HMM, E-Branchformer, Whisper, and wav2vec2-based self-supervised models, and find that fine-tuning on RESPIN significantly improves recognition accuracy over pretrained baselines. A subset of RESPIN-S1.0 has already supported community challenges such as the SLT Code Hackathon 2022 and MADASR@ASRU 2023 and 2025, releasing more than 1,200 hours publicly. This resource supports research in dialectal ASR, language identification, and related speech technologies, establishing a comprehensive benchmark for inclusive, dialect-rich ASR in multilingual low-resource settings. **Dataset:** https://spiredatasets.ee.iisc.ac.in/respincorpus **Code:** https://github.com/labspire/respin_baselines.git
Croissant File: json
Dataset URL: https://spiredatasets.ee.iisc.ac.in/respincorpus
Code URL: https://github.com/labspire/respin_baselines.git
Primary Area: Applications of Datasets & Benchmarks for in speech and audio
Flagged For Ethics Review: true
Submission Number: 2375
Loading