Speaker Characteristics Guided Speech Synthesis

Zhihan Yang, Zhiyong Wu, Jia Jia

2022 (modified: 17 Apr 2023)IJCNN 2022Readers: Everyone

Abstract: Talking head techniques are widely researched. Most of the previous works focus on the association among tones, prosody, and visual cues, such as head motion, lip movement, and gestures. However, it is widely believed the timbre, matching the voice with the speaker's identity, shall be considered, since people obtain speaker-specific information from both the auditory and visual modalities. This paper aims to generate proper voice characteristics in line with the speaker characteristics we select. We first select six speaker characteristics related to the voice qualities: gender, age, race, body mass index, face shape, and personality. We then train a Conditional Variational AutoEncoder with attention (attentionCVAE) model to infer speaker embeddings from speaker characteristics and employ a multi-speaker text-to-speech system to generate utterances of nonexistent speakers we set. Subjective tests indicate the proposed method successfully reconstructs real-world speaker embedding and generates realistic embedding from speaker characteristics. The further analysis uncovers how and to what extent the speaker characteristics influence the voice qualities of speakers.

0 Replies