The DKU Speech Synthesis System for 2019 Blizzard Challenge

Published: 01 Jan 2019, Last Modified: 05 Jun 2025Blizzard Challenge 2019EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This paper describes the DKU text-to-speech synthesis system built for the 2019 Blizzard Challenge. The task of this year’s challenge is to build a synthetic voice that is similar, expressive and clear as the given data collected from an internet talk show. The DKU speech synthesis system adopts the end-to-end speech synthesis architecture named Tacotron2. First, we analyze the data provided by the organizers and preprocess the data to make it appropriate for text-to-speech synthesis model training. The preprocessing phase includes audio-text aligning, segmentation and manually labeling the pinyin sequences. We pre-train a synthesis model trained with clean Mandarin Chinese speech synthesis dataset and finetune the model using the preprocessed data. In the synthesis phase, we preprocess the texts in evaluation set to obtain the appropriate phoneme sequences for synthesis. After feeding the phoneme sequences into the synthesis system, we use the Griflim algorithm to estimate the phase and convert the output spectrogram to audio. We report our result based on the system performance provided by the organizers.
Loading