Abstract: We introduce \textbf{SpeechQC-Agent}, a natural language–driven, multi-agent framework for automated verification of large-scale, multilingual speech-text datasets. Our system leverages a central Large Language Model (LLM) to interpret user-specified verification prompts and orchestrate a set of specialized agents that perform audio, transcript, and metadata quality checks. Each prompt is translated into a structured, dependency-aware workflow graph, executed through a combination of dynamically generated and pre-defined tools. To support evaluation, we release \textbf{SpeechQC-Dataset}, a synthetic yet realistic benchmark covering 15.5 hours of Hindi dialogue across diverse speakers, domains, and error types. Experiments across two verification stages-QC1 (audio and metadata) and QC2 (transcript and content), show that ChatGPT-based agents outperform open-weight LLMs in planning accuracy and execution robustness. We further adapt recent agentic evaluation protocols to measure workflow fidelity via subsequence and subgraph metrics. Our framework enables scalable, reproducible, and instruction-driven speech dataset verification, laying the foundation for high-quality speech corpus creation in low-resource settings.
Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: Automatic speech recognition, low resource, agent, large language model
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings
Languages Studied: Hindi
Submission Number: 7787
Loading