TRANSFORMER TRANSDUCER: A STREAMABLE SPEECH RECOGNITION MODEL WITH TRANSFORMER ENCODERS AND RNN-T LOSS

Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, Shankar Kumar

20 Jul 2020OpenReview Archive Direct UploadReaders: Everyone

Abstract: In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self- attention are used to encode both audio and label sequences indepen- dently. The activations from both audio and label encoders are com- bined with a feed-forward layer to compute a probability distribution over the label space for every combination of acoustic frame position and label history. This is similar to the Recurrent Neural Network Transducer (RNN-T) model, which uses RNNs for information en- coding instead of Transformer encoders. The model is trained with the RNN-T loss well-suited to streaming decoding. We present re- sults on the LibriSpeech dataset showing that limiting the left context for self-attention in the Transformer layers makes decoding compu- tationally tractable for streaming, with only a slight degradation in accuracy. We also show that the full attention version of our model beats the-state-of-the art accuracy on the LibriSpeech benchmarks. Our results also show that we can bridge the gap between full at- tention and limited attention versions of our model by attending to a limited number of future frames.

0 Replies