KrishiVaani: A Conversational Hindi Speech Corpus through Automatic ASR Post-Correction and Accelerated Refinement

ACL ARR 2025 May Submission3984 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Building robust, real-world Automatic Speech Recognition (ASR) datasets for Hindi remains challenging due to linguistic complexity, accent variations, and domain-specific vocabulary, particularly in agricultural contexts. This paper presents KrishiVaani, a conversational Hindi speech corpus that addresses the issues mentioned above. Additionally, we provide a comparative analysis of various LMs and LLMs for automatic ASR post-correction, aiding in model selection to minimize annotation effort. We also extend an open-source tool, VAgyojaka, to accelerate data validation and verification processes by 6x and 2x, respectively. This enhancement streamlines the creation of large-scale Hindi speech corpora, ensuring high-quality data through efficient annotation and error detection. Experimental results show that using KrishiVaani significantly improves accuracy across diverse speaker accents, environmental noise levels, and agricultural terminology.
Paper Type: Short
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: automatic speech recognition, large language model, post-correction
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources
Languages Studied: Hindi
Submission Number: 3984
Loading