The I2R-NWPU Text-to-Speech System for Blizzard Challenge 2017

Yanfeng Lu, Zhengchen Zhang, Chenyu Yang, Huaiping Ming, Xiaolian Zhu, Yuchao Zhang, Shan Yang, Dongyan Huang, Lei Xie, Minghui Dong

Published: 01 Jan 2017, Last Modified: 08 Oct 2025Blizzard Challenge 2017EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We present I2R-NWPU team’s entry to Blizzard Challenge 2017 in this paper. Like our previous entry, we still adopt the general deep neural network (DNN) guided unit selection and waveform concatenation method to synthesize the speech. But we make several important improvements to our previous system. Phone duration and frame level acoustic parameters modelled with long short-term memory (LSTM) recurrent neural network (RNN). But this time we the hidden Markov model (HMM) to assist pre-selection. Phone level instead of frame level units are used in the selection and concatenation process. In synthesizing the speech, the Kullback-Leibler Divergence (KLD) between the predicted target and the candidate spectrum HMMs is used to preselect the units. Then the duration and acoustic parameters of the preselected units are predicted with the LSTM-RNN models. The final units are selected with the Viterbi algorithm based on the target and concatenation costs calculated against the predicted trajectory. The listening tests show improvement compared with our previous system.