Frame-Level Selective Decoding Using Native and Non-native Acoustic Models for Robust Speech Recognition to Native and Non-native Speech

Yoo Rhee Oh, Hoon Chung, Jeom-ja Kang, Yun Keun Lee

2012 (modified: 06 Nov 2021)IWSDS 2012Readers: Everyone

Abstract: This paper proposes a frame-level selective-decoding method by using both native acoustic models (AMs) and non-native AMs in order to construct a robust speech recognition system for non-native speech as well as native speech. To this end, we use two kinds of well-trained AMs: (a) AMs trained with a large amount of native speech (native AMs) and (b) AMs trained with a plenty amount of non-native speech (non-native AMs). First, each speech feature vector is decoded using native AMs and non-native AMs in parallel. And, we select proper AMs by comparing the likelihoods of the two AMs. Then, the next M frames of speech feature vectors are decoded by using the selected AMs, where M is a pre-defined parameter. The selection and the decoding procedures are repeated until an end of an utterance is encountered. From automatic speech recognition (ASR) experiments for English spoken by Korean speakers, it is shown that an ASR system employing the proposed method reduces an average word error rate (WER) by 16.6% and 41.3% for English spoken by Koreans and native English, respectively, when compared to an ASR system employing an utterance-level selective-decoding method.

0 Replies