Global Normalization for Streaming Speech Recognition in a Modular Framework

Ehsan Variani; Ke Wu; Michael Riley; David Rybach; Matt Shannon; Cyril Allauzen

Global Normalization for Streaming Speech Recognition in a Modular Framework

Ehsan Variani, Ke Wu, Michael Riley, David Rybach, Matt Shannon, Cyril Allauzen

Published: 31 Oct 2022, Last Modified: 11 Jan 2023NeurIPS 2022 AcceptReaders: Everyone

TL;DR: Shows importance of global normalization for streaming ASR, filling 50% of the performance gap between streaming and non-streaming locally normalized models, and presents a general formulation of neural ASR models

Abstract: We introduce the Globally Normalized Autoregressive Transducer (GNAT) for addressing the label bias problem in streaming speech recognition. Our solution admits a tractable exact computation of the denominator for the sequence-level normalization. Through theoretical and empirical results, we demonstrate that by switching to a globally normalized model, the word error rate gap between streaming and non-streaming speech-recognition models can be greatly reduced (by more than 50% on the Librispeech dataset). This model is developed in a modular framework which encompasses all the common neural speech recognition models. The modularity of this framework enables controlled comparison of modelling choices and creation of new models. A JAX implementation of our models has been open sourced.

Supplementary Material: pdf

17 Replies

Loading