An Analysis of Decoding for Attention-Based End-to-End Mandarin Speech Recognition

Published: 01 Jan 2018, Last Modified: 27 Sept 2024ISCSLP 2018EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Many of the current state-of-the-art Mandarin Large Vocabulary Continuous Speech Recognition (LVCSR) Systems are built either with a hybrid of Deep Neural Network (DNN) and Hidden Markov Models (HMM) or with Neural Network model trained with Connectionist Temporal Classification (CTC) criterion. In both of these models, decoding is conducted by Weighted Finite State Transducer (WFST) that searches the word sequence which matches best with the speech given the acoustic and language models. Recently, attention-based end-to-end method is becoming more and more popular for Mandarin speech recognition. This new method advocates replacing complex data processing pipelines with a single neural network trained in an end-to-end fashion.In this paper, we investigate the decoding process for attention-based Mandarin models using syllable and character as acoustic modeling units and discuss how to combine word information into the decoding process. We also conduct a detailed analysis on various factors that affect the performance of decoding including beam size, label smoothing, softmax temperature, attention smoothing and coverage.
Loading