Learning Multilingual Expressive Speech Representation for Prosody Prediction without Parallel DataDownload PDF

Published: 15 Jun 2023, Last Modified: 27 Jun 2023SSW12Readers: Everyone
Keywords: speech synthesis, prosody prediction, speech generation
TL;DR: This paper proposes a method for learning a multilingual expressive speech representation for prosody prediction without parallel data.
Abstract: We propose a method for speech-to-speech emotion-preserving translation that operates at the level of discrete speech units. Our approach relies on the use of a multilingual emotion embedding that can capture affective information in a language-independent manner. We show that this embedding can be used to predict the pitch and duration of speech units in a target language, allowing us to resynthesize the source speech signal with the same emotional content. We evaluate our approach on English and French speech signals and show that it outperforms a baseline method that does not use emotion information, including when the emotion embedding is extracted from a different language. Even if this preliminary study does not address directly the machine translation issue, our results demonstrate the effectiveness of our approach for cross-lingual emotion preservation in the context of speech resynthesis.
3 Replies

Loading