E3-VITS: Emotional End-to-End TTS with Cross-speaker Style Transfer

Wonbin Jung; Junhyeok Lee

E3-VITS: Emotional End-to-End TTS with Cross-speaker Style Transfer

Wonbin Jung, Junhyeok Lee

Published: 23 Jun 2023, Last Modified: 07 Jul 2023DeployableGenerativeAIEveryoneRevisions

Keywords: text-to-speech, speech synthesis, emotional text-to-speech, end-to-end text-to-speech

TL;DR: An end-to-end emotional TTS system with a language model conditioned by both reference speech or textual description

Abstract: Since previous emotional TTS models are based on a two-stage pipeline or additional labels, their training process is complex and requires a high labeling cost. To deal with this problem, this paper presents E3-VITS, an end-to-end emotional TTS model that addresses the limitations of existing models. E3-VITS synthesizes high-quality speeches for multi-speaker conditions, supports both reference speech and textual description-based emotional speech synthesis, and enables cross-speaker emotion transfer with a disjoint dataset. To implement E3-VITS, we propose batch-permuted style perturbation, which generates audio samples with unpaired emotion to increase the quality of cross-speaker emotion transfer. Results show that E3-VITS outperforms the baseline model in terms of naturalness, speaker and emotion similarity, and inference speed.

Submission Number: 23

Loading