Abstract: Inspired by the recent successes of deep generative models for Text-To-Speech (TTS) such as WaveNet (van den Oord et al., 2016) and Tacotron (Wang et al., 2017), this article proposes the use of a deep generative model tailored for Automatic Speech Recognition (ASR) as the primary acoustic model (AM) for an overall recognition system with a separate language model (LM). Two dimensions of depth are considered: (1) the use of mixture density networks, both autoregressive and non-autoregressive, to generate density functions capable of modeling acoustic input sequences with much more powerful conditioning than the first-generation generative models for ASR, Gaussian Mixture Models / Hidden Markov Models (GMM/HMMs), and (2) the use of standard LSTMs, in the spirit of the original tandem approach, to produce discriminative feature vectors for generative modeling. Combining mixture density networks and deep discriminative features leads to a novel dual-stack LSTM architecture directly related to the RNN Transducer (Graves, 2012), but with the explicit functional form of a density, and combining naturally with a separate language model, using Bayes rule. The generative models discussed here are compared experimentally in terms of log-likelihoods and frame accuracies.
Keywords: Automatic Speech Recognition, Deep generative models, Acoustic modeling, End-to-end speech recognition
TL;DR: This paper proposes the use of a deep generative acoustic model for automatic speech recognition, combining naturally with other deep sequence-to-sequence modules using Bayes' rule.