X-E-Speech: Joint Training Framework of Non-Autoregressive Cross-lingual Emotional Text-to-Speech and Voice Conversion

OpenReview Anonymous Preprint Submission561 Authors

29 Feb 2024Anonymous Preprint SubmissionEveryoneCC BY-NC-ND 4.0

Keywords: joint training, text-to-speech, voice conversion, cross-lingual, emotional

Abstract: Large Language Models (LLMs) have been widely used in cross-lingual and emotional speech synthesis, but they require extensive data and retain the drawbacks of previous autoregressive (AR) speech models, such as slow inference speed and lack of robustness and interpretation. In this paper, we propose a cross-lingual emotional speech generation model, X-E-Speech, which achieves the disentanglement of speaker style and cross-lingual content features by jointly training non-autoregressive (NAR) voice conversion (VC) and text-to-speech (TTS) models. For TTS, we freeze the style-related model components and fine-tune the content-related structures to enable cross-lingual emotional speech synthesis without accent. For VC, we improve the emotion similarity between the generated results and the reference speech by introducing the similarity loss between content features for VC and text for TTS.

Submission Number: 561