Abstract: In this paper, we study a simple yet elegant latent variable attention model for automatic speech recognition (ASR) which
enables an integration of attention sequence modeling into
the direct hidden Markov model (HMM) concept. We use
a sequence of hidden variables that establishes a mapping
from output labels to input frames. Inspired by the direct
HMM model, we assume a decomposition of the label sequence posterior into emission and transition probabilities using zero-order assumption and incorporate both Transformer
and LSTM attention models into it. The method keeps the explicit alignment as part of the stochastic model and combines
the ease of the end-to-end training of the attention model as
well as an efficient and simple beam search. To study the
effect of the latent model, we qualitatively analyze the alignment behavior of the different approaches. Our experiments
on three ASR tasks show promising results in WER with more
focused alignments in comparison to the attention models.
0 Replies
Loading