Is Synthetic Data Sufficient for Extractive Spoken Question Answering?

ACL ARR 2024 June Submission2384 Authors

15 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Spoken language understanding is essential for extracting meaning from spoken language, particularly in low- or zero-resource language settings relying on speech in the absence of text data. This work investigates the effectiveness of using synthetic speech data in Spoken Question Answering (SQA). By manipulating prosody in human-read test sets, as well as proposing a new SQA dataset for fine-tuning, we demonstrate that models trained solely on synthetic speech can utilise prosodic cues. Moreover, synthetic speech fine-tuned models outperform those fine-tuned on natural speech, even with the same or restricted lexical information. Our findings suggest that current text-to-speech systems can simulate sufficient prosody for SQA models, and that the contribution from natural prosody is limited within the current textless SQA framework.
Paper Type: Short
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: spoken language understanding, QA via spoken queries
Contribution Types: Model analysis & interpretability, Data analysis, Position papers
Languages Studied: English
Submission Number: 2384
Loading