VāṇīSetu: A Human-AI Collaborative Framework for Scalable Conversational Speech Corpus Creation in Low-Resource Settings
Keywords: Speech corpus, Human–AI collaboration, Low-resource, Domain adaptation
Abstract: We present **VāṇīSetu**, a human-AI collaborative framework for constructing high-quality speech corpora in low-resource languages. As a case study, we apply VāṇīSetu to create **KrishiVāṇī**, a 100-hour Hindi dataset of **unscripted**, **noisy**, and **code-mixed** agricultural speech mined from YouTube. VāṇīSetu integrates automatic speech recognition (ASR), lightweight and large language model-based post-correction, and structured, multi-stage human validation implemented though an enhanced annotation tool, **Vāgyojaka**. Experiments show that domain-specific fine-tuning improves ASR accuracy on real-world agricultural speech, and that small language models such as mT5 provide low-latency corrections that reduce annotation effort by **61\%** while preserving transcript fidelity. By shifting annotators from manual transcribers to informed validators, VāṇīSetu enables scalable and linguistically rich corpus creation, and highlights practical cost-quality-latency trade-offs in integrating LMs/LLMs into human-in-the-loop dataset development.
Paper Type: Short
Research Area: Speech Processing and Spoken Language Understanding
Research Area Keywords: automatic speech recognition, spoken language technologies
Contribution Types: Approaches to low-resource settings
Languages Studied: Hindi, Marathi, Telugu
Submission Number: 8011
Loading