Deep neural network based voice conversion with a large synthesized parallel corpus

Zhengqi Wen, Kehuang Li, Jianhua Tao, Chin-Hui Lee

Published: 2016, Last Modified: 07 Feb 2024APSIPA 2016Readers: Everyone

Abstract: We propose a voice conversion framework to map the speech features of a source speaker to a target speaker based on deep neural networks (DNNs). Due to a limited availability of the parallel data needed for a pair of source and target speakers, speech synthesis and dynamic time warping are utilized to construct a large parallel corpus for DNN training. With a small corpus to train DNNs, a lower log spectral distortion can still be seen over the conventional Gaussian mixture model (GMM) approach, trained with the same data. With the synthesized parallel corpus, a speech naturalness preference score of about 54.5% vs. 32.8% and a speech similarity preference score of about 52.5% vs. 23.6% are observed for the DNN-converted speech from the large parallel corpus when compared with the DNN-converted speech from the small parallel corpus.

0 Replies