A reference audio was constructed by concatenating two speech samples, each expressing a different emotion, to explicitly include multiple emotional cues within a single utterance.
| Emotion | Index | Audio prompt | Generated audio | ||||
|---|---|---|---|---|---|---|---|
| Voicebox | ELaTE | EmoCtrl-TTS | F5-TTS | Ours | |||
| Angry → Calm | (a) | ||||||
| (b) | |||||||
| Sad → Surprised | (a) | ||||||
| (b) | |||||||
| Happy → Disgusted | (a) | ||||||
| (b) | |||||||
| Calm → Fearful | (a) | ||||||
| (b) | |||||||
Japanese-to-English speech-to-speech translation
| Emotion | Index | Source audio (Japanese) | Translated audio (English) | |||||
|---|---|---|---|---|---|---|---|---|
| SeamlessExpressive | Voicebox(*) | ELaTE(*) | EmoCtrl-TTS(*) | F5-TTS(**) | Ours(**) | |||
(*): They share same backbone model (Voicebox)
(**): They share same backbone model (F5-TTS)