Intelligibility Improvement of Dysarthric Speech using MMSE DiscoGAN

Mirali Purohit, Maitreya Patel, Harshit Malaviya, Ankur T. Patil, Mihir Parmar, Nirmesh J. Shah, Savan Doshi, Hemant A. Patil

2020 (modified: 22 Sept 2022)SPCOM 2020Readers: Everyone

Abstract: Dysarthria is a manifestation of the disordering in articulatory parts that are used during speech production, which results in uneven, slow, slurred, monotone speech or speech in an abnormal rhythm. People with dysarthria produce less intelligible speech. Improving the intelligibility of dysarthric speech is challenging because unlike normal speech, there is less amount of data for dysarthric speech. It is a known fact that dysarthric speech and normal speech are different in speech production-perception perspectives. Recently, Generative Adversarial Network (GAN)-based architectures have become more popular to learn such kind of cross-domain relationships efficiently. In this paper, we propose to use Discover GAN (DiscoGAN) along with Mean Square Error (MSE) regularization (i.e., MMSE DiscoGAN) for Dysarthric-to-Normal speech conversion. In particular, a direct feature-based mapping technique is used to train all the models. In the end, we use the Automatic Speech Recognition (ASR) to measure the Phoneme Error Rate (PER) for a particular speaker. Proposed method is compared with baseline Deep Neural Network (DNN)-based system. Training of both the architectures and the evaluations were carried out on UA corpus. By analyzing the results, we observed that MMSE DiscoGAN outperforms DNN by 13.16% and 9.64% for male and female, respectively. Moreover, proposed GAN-based frameworks efficiently improve the intelligibility of dysarthric speech, and generate more naturalsounding speech compared to the DNN-based models.

0 Replies