Elastic spectral distortion for low resource speech recognition with deep neural networks

Naoyuki Kanda; Ryu Takeda; Yasunari Obuchi

Elastic spectral distortion for low resource speech recognition with deep neural networks

Naoyuki Kanda, Ryu Takeda, Yasunari Obuchi

Published: 01 Jan 2013, Last Modified: 23 Aug 2024ASRU 2013EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: An acoustic model based on hidden Markov models with deep neural networks (DNN-HMM) has recently been proposed and achieved high recognition accuracy. In this paper, we investigated an elastic spectral distortion method to artificially augment training samples to help DNN-HMMs acquire enough robustness even when there are a limited number of training samples. We investigated three distortion methods - vocal tract length distortion, speech rate distortion, and frequency-axis random distortion - and evaluated those methods with Japanese lecture recordings. In a large vocabulary continuous speech recognition task with only 10 hours of training samples, a DNN-HMM trained with the elastic spectral distortion method achieved a 10.1% relative word error reduction compared with a normally trained DNN-HMM.

Loading