Zero-shot Learning for Speech Recognition with Universal Phonetic Model

Xinjian Li; Siddharth Dalmia; David R. Mortensen; Florian Metze; Alan W Black

Zero-shot Learning for Speech Recognition with Universal Phonetic Model

Xinjian Li, Siddharth Dalmia, David R. Mortensen, Florian Metze, Alan W Black

27 Sept 2018 (modified: 05 May 2023)ICLR 2019 Conference Blind SubmissionReaders: Everyone

Abstract: There are more than 7,000 languages in the world, but due to the lack of training sets, only a small number of them have speech recognition systems. Multilingual speech recognition provides a solution if at least some audio training data is available. Often, however, phoneme inventories differ between the training languages and the target language, making this approach infeasible. In this work, we address the problem of building an acoustic model for languages with zero audio resources. Our model is able to recognize unseen phonemes in the target language, if only a small text corpus is available. We adopt the idea of zero-shot learning, and decompose phonemes into corresponding phonetic attributes such as vowel and consonant. Instead of predicting phonemes directly, we first predict distributions over phonetic attributes, and then compute phoneme distributions with a customized acoustic model. We extensively evaluate our English-trained model on 20 unseen languages, and find that on average, it achieves 9.9% better phone error rate over a traditional CTC based acoustic model trained on English.

Keywords: zero-shot learning, speech recognition, acoustic modeling

TL;DR: We apply zero-shot learning for speech recognition to recognize unseen phonemes

13 Replies

Loading