VāṇīSetu: A Human-AI Collaborative Framework for Scalable Conversational Speech Corpus Creation in Low-Resource Settings

VāṇīSetu: A Human-AI Collaborative Framework for Scalable Conversational Speech Corpus Creation in Low-Resource Settings

ACL ARR 2026 January Submission8011 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Speech corpus, Human–AI collaboration, Low-resource, Domain adaptation

Abstract: We present **VāṇīSetu**, a human-AI collaborative framework for constructing high-quality speech corpora in low-resource languages. As a case study, we apply VāṇīSetu to create **KrishiVāṇī**, a 100-hour Hindi dataset of **unscripted**, **noisy**, and **code-mixed** agricultural speech mined from YouTube. VāṇīSetu integrates automatic speech recognition (ASR), lightweight and large language model-based post-correction, and structured, multi-stage human validation implemented though an enhanced annotation tool, **Vāgyojaka**. Experiments show that domain-specific fine-tuning improves ASR accuracy on real-world agricultural speech, and that small language models such as mT5 provide low-latency corrections that reduce annotation effort by **61\%** while preserving transcript fidelity. By shifting annotators from manual transcribers to informed validators, VāṇīSetu enables scalable and linguistically rich corpus creation, and highlights practical cost-quality-latency trade-offs in integrating LMs/LLMs into human-in-the-loop dataset development.

Paper Type: Short

Research Area: Speech Processing and Spoken Language Understanding

Research Area Keywords: automatic speech recognition, spoken language technologies

Contribution Types: Approaches to low-resource settings

Languages Studied: Hindi, Marathi, Telugu

Submission Number: 8011

Loading