Learning Adaptive Downsampling Encoding for Online End-to-End Speech Recognition

Rui Na; Junfeng Hou; Wu Guo; Yan Song; Lirong Dai

Learning Adaptive Downsampling Encoding for Online End-to-End Speech Recognition

Rui Na, Junfeng Hou, Wu Guo, Yan Song, Lirong Dai

Published: 01 Jan 2019, Last Modified: 03 Apr 2025APSIPA 2019EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Attention based encoder-decoder models have shown promising performance for various sequence-to-sequence problems. However, for speech recognition, the very long input speech consumes a lot of computation and memory resource when performing encoding and soft attention over the input sequence. While fixed-rate downsampling is usually employed to reduce the computation steps, it fails to consider the variable durations of phonemes. Motivated by this, we propose a differentiable adaptive downsampling approach which encodes the input sequence with a recurrent layer by keeping crucial frames and discarding redundant frames adaptively in real-time. Therefore, the proposed downsampling approach can dynamically generate input hidden representations and is suitable for online end-to-end speech recognition. Experiments show that our proposed method can reduce phone error rate (PER) by 7.0 % relative without loss of speed compared with fixed downsampling technique. In addition, the adaptive encoding makes the model robust to variable speed speech.

Loading