Abstract: We show the performance of Automatic Speech Recognition (ASR) systems that use semi-supervised speech representations can be be boosted by a complimentary prosody detection module, by introducing a joint ASR and prosody detection model. The prosody detection component of our model achieves a significant improvement on the state-of-the-art for the task, closing the gap in F1-score by 41%. Additionally, the ASR performance in joint training decreases WER by 28.3% on LibriSpeech, under limited resource fine-tuning. With these results, we show the importance of extending pretrained speech models to retain or relearn important prosodic cues.
Paper Type: Short
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 2557
Loading