Abstract: Prosody modelling is an essential part of the text-to- speech synthesis system. In this paper we propose and investigate a way to leverage public domain audiobook data to do word level prosody modelling. Specifically we base our work on the LibriSpeech project, in which a large quantity of public domain audiobook data from LibriVox were processed, selected and aligned with text. We choose long-short-term-memory recurrent deep neural network as the modelling tool. The input word features spread from phonetic, through syntactic, to semantic layers. The word prosody features include log F0, energy and after-word break. A way of incorporating the word prosody model into the speech synthesis system is also proposed. Experiments show that it is an effective way to leverage large quantity and variety of speech data to do prosody modelling for speech synthesis.
0 Replies
Loading