ArtSpeech: Adaptive Text-to-Speech Synthesis with Articulatory Representations

Zhongxu Wang; Yujia Wang; Mingzhu Li; Hua Huang

ArtSpeech: Adaptive Text-to-Speech Synthesis with Articulatory Representations

Zhongxu Wang, Yujia Wang, Mingzhu Li, Hua Huang

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We devise an articulatory representation-based text-to-speech (TTS) model, ArtSpeech, an explainable and effective network for human-like speech synthesis, by revisiting the sound production system. Current deep TTS models learn acoustic-text mapping in a fully parametric manner, ignoring the explicit physical significance of articulation movement. ArtSpeech, on the contrary, leverages articulatory representations to perform adaptive TTS, clearly describing the voice tone and speaking prosody of different speakers. Specifically, energy, F0, and vocal tract variables are utilized to represent airflow forced by articulatory organs, the degree of tension in the vocal folds of the larynx, and the coordinated movements between different organs, respectively. We further designed a multi-dimensional style mapping network to extract speaking styles from diverse articulatory representations. These speaking styles will be utilized to guide the output of the articulatory variation predictors respectively, and ultimately predict the final mel spectrogram out-put. Experiment results show that, compared to other open-source zero-shot TTS systems, ArtSpeech enhances synthesis quality and greatly boosts the similarity between the generated results and the target speaker’s voice and prosody.

Primary Subject Area: [Generation] Generative Multimedia

Relevance To Conference: Our research significantly contributes to the themes of this conference. First, we introduce a generative model to facilitate speech synthesis from text. The model leverages the physical significance of the articulatory system, enabling zero-shot style transfer of custom voices. Second, we propose an articulatory feature extraction model to obtain articulatory representations from the reference speech, which are subsequently used as stylistic guidance during the synthesis process. Third, we propose a multi-dimensional style transfer module to synthesize speeches that are similar to the target speakers' voice tone and speaking style. In summary, this work not only improves the performance of the synthesized results but also introduces novel perspectives of the articulatory representations in TTS techniques.

Supplementary Material: zip

Submission Number: 2547

Loading