Self-Supervised Speech Recognition via Local Prior Matching

Wei-Ning Hsu; Ann Lee; Gabriel Synnaeve; Awni Hannun

Self-Supervised Speech Recognition via Local Prior Matching

Wei-Ning Hsu, Ann Lee, Gabriel Synnaeve, Awni Hannun

25 Sept 2019 (modified: 05 May 2023)ICLR 2020 Conference Blind SubmissionReaders: Everyone

TL;DR: on-the-fly soft pseudo-labeling with LM weighting is better than [off-line hard pseudo-labeling | alternatives] for semi-supervised speech recognition

Abstract: We propose local prior matching (LPM), a self-supervised objective for speech recognition. The LPM objective leverages a strong language model to provide learning signal given unlabeled speech. Since LPM uses a language model, it can take advantage of vast quantities of both unpaired text and speech. The loss is theoretically well-motivated and simple to implement. More importantly, LPM is effective. Starting from a model trained on 100 hours of labeled speech, with an additional 360 hours of unlabeled data LPM reduces the WER by 26% and 31% relative on a clean and noisy test set, respectively. This bridges the gap by 54% and 73% WER on the two test sets relative to a fully supervised model on the same 360 hours with labels. By augmenting LPM with an additional 500 hours of noisy data, we further improve the WER on the noisy test set by 15% relative. Furthermore, we perform extensive ablative studies to show the importance of various configurations of our self-supervised approach.

Keywords: speech recognition, self-supervised learning, language model, semi-supervised learning, pseudo labeling

Original Pdf: pdf

8 Replies

Loading