Unsupervised Speech Recognition

Alexei Baevski; Wei-Ning Hsu; Alexis Conneau; Michael Auli

Unsupervised Speech Recognition

Alexei Baevski, Wei-Ning Hsu, Alexis Conneau, Michael Auli

Published: 09 Nov 2021, Last Modified: 26 May 2025NeurIPS 2021 OralReaders: Everyone

Keywords: Deep learning, speech processing, unsupervised learning, self-supervised learning, adversarial learning, GAN

TL;DR: Unsupervised learning of speech recognition models using self-supervised representations.

Abstract: Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe. This paper describes wav2vec-U, short for wav2vec Unsupervised, a method to train speech recognition models without any labeled data. We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training. The right representations are key to the success of our method. Compared to the best previous unsupervised work, wav2vec-U reduces the phone error rate on the TIMIT benchmark from 26.1 to 11.3. On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago. We also experiment on nine other languages, including low-resource languages such as Kyrgyz, Swahili and Tatar.

Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.

Supplementary Material: pdf

Code: https://github.com/pytorch/fairseq/tree/main/examples/wav2vec/unsupervised

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 3 code implementations](https://www.catalyzex.com/paper/unsupervised-speech-recognition/code)

8 Replies

Loading