Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription

Frank Seide, Gang Li, Xie Chen, Dong Yu

2011 (modified: 14 Dec 2021)ASRU 2011Readers: Everyone

Abstract: We investigate the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective. Recently, we had shown that for speaker-independent transcription of phone calls (NIST RT03S Fisher data), CD-DNN-HMMs reduced the word error rate by as much as one third-from 27.4%, obtained by discriminatively trained Gaussian-mixture HMMs with HLDA features, to 18.5%-using 300+ hours of training data (Switchboard), 9000+ tied triphone states, and up to 9 hidden network layers.

0 Replies