The I2R-NWPU-NTU Text-to-Speech System at Blizzard Challenge 2016

Zhengchen Zhang, Mei Li, Yuchao Zhang, Weini Zhang, Yang Liu, Shan Yang, Yanfeng Lu, Van Tung Pham, Lei Xie, Minghui Dong

Published: 2016, Last Modified: 09 Mar 2026Blizzard Challenge 2016EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper, we introduce a trajectory tiling method guided by deep neural networks (DNNs) for text-to-speech (TTS), which is the entry to Blizzard Challenge 2016 by I2R-NWPU-NTU team. We build a deep bidirectional LSTM (DBLSTM) based network to predict the phoneme level duration and frame level acoustic parameters. After the acoustic parameters are predicted, the best units are selected from the database using a trajectory tiling method. Experiments demonstrate that, under the DBLSTM framework, the context information of a phoneme extracted in text processing will help the duration prediction, while not help the acoustic modeling. The results of subjective evaluation are also discussed.