IMPACT-TTS: A Multimodal Prompt and Control Approach for Overcoming Low-Resource Constraints in Emotional TTS
Abstract: Advancing emotional expressiveness in Text-to-Speech (TTS) systems remains a pivotal challenge for achieving natural and adaptive voice synthesis. Existing emotion-aware TTS models often struggle with limited emotional diversity, lack of fine-grained control, and reliance on small, labeled emotional speech-text datasets, making them less scalable and adaptable. To address these limitations, we propose IMPACT-TTS, an Integrated Multimodal Prompt and Control for Emotional TTS system that effectively leverages a disentangled emotion module and a novel emotion modulation function. By incorporating large-scale pretrained multimodal models, IMPACT-TTS mitigates dataset constraints while enabling flexible emotional adjustments via prompt-based control. Our approach allows seamless blending of emotional intensities, significantly enhancing expressiveness even in low-resource labeled datasets. Experimental results demonstrate that IMPACT-TTS outperforms existing models in emotional naturalness and adaptability, offering a scalable solution for emotion-aware TTS.
Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: Text-to-Speech, Multimodal, Emotional TTS, Low Resource
Contribution Types: Approaches to low-resource settings
Languages Studied: English
Submission Number: 8204
Loading