A Generative Adversarial Network Based Ensemble Technique for Automatic Evaluation of Machine Synthesized Speech

Jaynil Jaiswal, Ashutosh Chaubey, Sasi Kiran Reddy Bhimavarapu, Shashank Kashyap, Puneet Kumar, Balasubramanian Raman, Partha Pratim Roy

Published: 01 Jan 2019, Last Modified: 13 Nov 2024ACPR (2) 2019EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper, we propose a method to automatically compute a speech evaluation metric, Virtual Mean Opinion Score (vMOS) for the speech generated by Text-to-Speech (TTS) models to analyse its human-ness. In contrast to the currently used manual speech evaluation techniques, the proposed method uses an end-to-end neural network to calculate vMOS which is qualitatively similar to manually obtained Mean Opinion Score (MOS). The Generative Adversarial Network (GAN) and a binary classifier have been trained on real natural speech with known MOS. Further, the vMOS has been calculated by averaging the scores obtained by the two networks. In this work, the input to GAN’s discriminator is conditioned with the speech generated by off-the-shelf TTS models so as to get closer to the natural speech. It has been shown that the proposed model can be trained with a minimum amount of data as its objective is to generate only the evaluation score and not speech. The proposed method has been tested to evaluate the speech synthesized by state-of-the-art TTS models and it has reported the vMOS of 0.6675, 0.4945 and 0.4890 for Wavenet2, Tacotron and Deepvoice3 respectively while the vMOS for natural speech is 0.6682 on a scale from 0 to 1. These vMOS scores correspond to and are qualitatively explained by their manually calculated MOS scores.