Abstract: In this paper, we propose a method of representing linguistic features for Chinese text-to-speech (TTS) systems. Traditional linguistic features for Chinese TTS include information about the related phones, syllables, words, prosodic words, prosodic phrases, etc. Normally, models would need to be built for word segmentation, part-of-speech tagging, prosodic word prediction, prosodic phrase prediction, etc. To train these models, large annotated corpora are normally needed. The quality of the models will have an effect on the quality of the TTS system. However, the annotated corpora are often unavailable for many low-resource languages. In this paper, we encode phone sequences and Chinese character sequences to form linguistic features without using annotated corpora. To represent the Chinese sequences, each Chinese character is converted into its pronunciation (represented by Pinyin), which is further mapped to an initial, a final and a tone. One- hot vectors can be used to represent each of the elements. To represent a context window covering a sequence of Chinese characters, we concatenate vectors of each character to form a long vector. Then an autoencoder is built to reduce the dimension of the vector. The compressed vector is used as part of the linguistic features together with phone sequence features. We have applied our proposed linguistic features in a deep neural network (DNN)-based TTS system, and we have compared the system with the one using traditional linguistic features. Both objective evaluation and subjective listening test show that the proposed linguistic representation achieves almost the same performance as the traditional features.
0 Replies
Loading