The I2R-NWPU-NTU Text-to-Speech System at Blizzard Challenge 2016

Published: 01 Jan 2016, Last Modified: 08 Oct 2025Blizzard Challenge 2016EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In this paper, we introduce a trajectory tiling method guided by deep neural networks (DNNs) for text-to-speech (TTS), which is the entry to Blizzard Challenge 2016 by I2R-NWPU-NTU team. We build a deep bidirectional LSTM (DBLSTM) based network to predict the phoneme level duration and frame level acoustic parameters. After the acoustic parameters are predicted, the best units are selected from the database using a trajectory tiling method. Experiments demonstrate that, under the DBLSTM framework, the context information of a phoneme extracted in text processing will help the duration prediction, while not help the acoustic modeling. The results of subjective evaluation are also discussed.
Loading