LongFNT: Long-Form Speech Recognition with Factorized Neural Transducer

Xun Gong; Yu Wu; Jinyu Li; Shujie Liu; Rui Zhao; Xie Chen; Yanmin Qian

LongFNT: Long-Form Speech Recognition with Factorized Neural Transducer

Xun Gong, Yu Wu, Jinyu Li, Shujie Liu, Rui Zhao, Xie Chen, Yanmin Qian

Published: 01 Jan 2023, Last Modified: 14 May 2025ICASSP 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Traditional automatic speech recognition (ASR) systems usually focus on individual utterances, without considering long-form speech with useful historical information, which is more practical in real scenarios. Simply attending longer transcription history for a vanilla neural transducer model shows no much gain in our preliminary experiments, since the prediction network is not a pure language model. This motivates us to leverage the factorized neural transducer structure, containing a real language model, the vocabulary predictor. We propose the LongFNT-Text architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor and then embeds token-level long-form features inside the vocabulary predictor, with a pre-trained contextual encoder RoBERTa to further boost the performance. Moreover, we propose the LongFNT architecture by extending the long-form speech to the original speech input and achieve the best performance. The effectiveness of our LongFNT approach is validated on LibriSpeech and GigaSpeech corpora with 19% and 12% relative word error rate (WER) reduction, respectively.

Loading