T-VecTTS: Adding time-varying-emotion control to flow-matching-based TTS

submision number: 12680

EMO-Change

A reference audio was constructed by concatenating two speech samples, each expressing a different emotion, to explicitly include multiple emotional cues within a single utterance.

Emotion Index Audio prompt Generated audio
Voicebox ELaTE EmoCtrl-TTS F5-TTS Ours
Angry → Calm (a)
(b)
Sad → Surprised (a)
(b)
Happy → Disgusted (a)
(b)
Calm → Fearful (a)
(b)

JVNV S2ST

Japanese-to-English speech-to-speech translation

Emotion Index Source audio (Japanese) Translated audio (English)
SeamlessExpressive Voicebox(*) ELaTE(*) EmoCtrl-TTS(*) F5-TTS(**) Ours(**)

(*): They share same backbone model (Voicebox)

(**): They share same backbone model (F5-TTS)