Keywords: Text-to-Speech, Emotion control, controllable speech synthesis, flow matching, controlnet, time-varying conditioning, zero-shot TTS
TL;DR: Seamlessly adding time-varying-emotion control to the pretrained flow-matching based TTS
Abstract: Recent advances in text-to-speech (TTS) have enabled natural speech synthesis, yet fine-grained, time-varying emotion control remains challenging. Existing methods typically provide only utterance-level control or require full-model fine-tuning with a large in-house emotional speech corpus. To overcome these shortcomings, we propose a time-varying-emotion controllable TTS (T-VecTTS), seamlessly adding the additional control to the pre-trained flow-matching-based TTS. To leverage the off-the-shelf model, we freeze the original model and attach a trainable branch that processes additional conditioning signals for emotion control.
Moreover, we identify the flow step interval that is responsible for determining emotions and use it for detailed control.
We further provide practical recipes for emotion control on three components: (1) an optimal layer choice via block-level analysis, (2) control scale during inference, and (3) selecting the temporal emotion window size.
The advantages of our method include the zero-shot voice cloning capability, naturalness of the synthesized speech, and no need for a large emotional speech corpus or full-model fine-tuning.
T-VecTTS achieves state-of-the-art emotion similarity scores (Emo-SIM and Aro–Val SIM).
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 12680
Loading