Knowledge Distillation From Offline to Streaming Transducer: Towards Accurate and Fast Streaming Model by Matching Alignments

Ji-Hwan Mo, Jae-Jin Jeon, Mun-Hak Lee, Joon-Hyuk Chang

Published: 2023, Last Modified: 24 Apr 2026ASRU 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Sequence transducer is a popular end-to-end automatic speech recognition model for streaming scenarios: While, there is a trade-off between accuracy and latency. Latency regularization methods such as FastEmit can reduce latency, but the more they try to reduce latency, the worse accuracy tends to be. Conversely, knowledge distillation (KD) is only used to improve accuracy, and latency is not considered. In this paper, we propose an effective method that combines FastEmit with the KD to reduce latency and improve the accuracy of offline model in scenarios where the latency gap between offline and streaming models gets small. This method reduce the latency gap by applying with FastEmit to both the offline and streaming models. Experimental results on the LibriSpeech dataset show that the model with the best trade-off between accuracy and latency achieves a relative error reduction rate of 7.5% and reduces the latency by $130 \mathrm{~ms}$ compared with the streaming conformer transducer.
Loading