SpeechQC-Agent: A Natural Language Driven Multi-Agent System for Speech Dataset Quality

ICLR 2026 Conference Submission16482 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: low resource, multi agent, large language model, automatic speech recognition
TL;DR: SpeechQC-Agent is a natural language-driven multi-agent system that transforms user instructions into structured validation workflows for scalable and automated speech dataset quality checks.
Abstract: Ensuring the quality of large-scale datasets is a prerequisite for reliable machine learning, yet current verification pipelines are static, domain-specific, and heavily reliant on human experts. We introduce **SpeechQC-Agent**, the first natural language-driven agentic framework for dataset quality control that generalizes across modalities, vendors, and languages. A central planner LLM decomposes user queries into directed acyclic graph (DAG) workflows executed by modular sub-agents that combine reusable tools with LLM-synthesized functions, enabling flexible and scalable verification. Unlike rule-based scripts, this design supports parallelism, dependency management, and adaptive extension to novel schemas. To benchmark verification systems, we release **SpeechQC-Dataset**, a multilingual speech corpus with controlled perturbations spanning audio, transcripts, and metadata, allowing systematic evaluation of 24 verification tasks. Experiments show that SpeechQC-Agent achieves 80-90\% of expert level accuracy while operating at less than 20\% of cost and time and generalizes from synthetic perturbations to real vendor-supplied corpora. Comparative analysis across multiple planner LLMs highlights trade-offs between fidelity (GPT-4.1-mini), efficiency (LLaMA-3.3-70B), and reasoning strength (DeepSeek-R1). Beyond speech, our approach establishes a general paradigm for LLM-driven workflow generation in dataset quality assurance, with implications for the curation of multimodal and multilingual resources on scale.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16482
Loading