From Speech Signals to Semantics - Tagging Performance at Acoustic, Phonetic and Word Levels

Yao Qian, Rutuja Ubale, Patrick L. Lange, Keelan Evanini, Frank K. Soong

2018 (modified: 08 Nov 2021)ISCSLP 2018Readers: Everyone

Abstract: Spoken language understanding (SLU) is to decode the semantic information embedded in speech input. SLU decoding can be significantly degraded by mismatched acoustic/language models between training and testing of a decoder. In this paper we investigate the semantic tagging performance of bidirectional LSTM RNN (BLSTM-RNN) with input at acoustic, phonetic and word levels. It is tested on a crowdsourced, spoken dialog speech corpus spoken by non-native speakers in a job interview task. The tagging performance is shown to be improved successively from low-level, acoustic MFCC, midlevel, stochastic senone posteriorgram, to high-level, ASR recognized word string, with the corresponding tagging accuracies at 70.6%, 82.1% and 85.1%, respectively. With a score fusion of the three individual RNNs together, the accuracy can be further improved to 87.0%.

0 Replies