Abstract: We present a simple idea that allows to record a speaker in a given language and synthesize their voice in other languages that they may not even know. These techniques open a wide range of potential applications such as cross-language communication, language learning or automatic video dubbing. We call this general problem multi-language speaker-conditioned speech synthesis and we present a simple but strong baseline for it.
Our model architecture is similar to the encoder-decoder Char2Wav model or Tacotron. The main difference is that, instead of conditioning on characters or phonemes that are specific to a given language, we condition on a shared phonetic representation that is universal to all languages. This cross-language phonetic representation of text allows to synthesize speech in any language while preserving the vocal characteristics of the original speaker. Furthermore, we show that fine-tuning the weights of our model allows us to extend our results to speakers outside of the training dataset.
Keywords: Speech synthesis, Voice cloning, TTS
TL;DR: We present a simple idea that allows to record a speaker in a given language and synthesize their voice in other languages that they may not even know.
3 Replies
Loading