Abstract: It has been shown that sequence-discriminative training can improve the performance for large vocabulary continuous speech recognition. Our main contribution is a novel method for reducing the computation time of any sort of sequence training while only slightly decreasing the overall performance. The method allows to parallelize the forward propagation through the network, the loss and loss gradient calculation which will provide a frame-wise error signal, and an independent forward and back propagation using that error signal. That last step can be calculated in a frame-wise manner and thus allows to use frame chunking to further improve the runtime. The loss calculation can itself be parallelized over many sequences. In addition to several experiments which outline the runtime gains, we also provide a convergence proof sketch. We extend on the research of sequence training of bidirectional long-short term memory ((B)LSTM) networks and provide an overview and comparison over different criteria. We have published all the code as part of our RETURNN and RASR framework including our training setup configurations.
0 Replies
Loading