IMPACT-TTS: A Multimodal Prompt and Control Approach for Overcoming Low-Resource Constraints in Emotional TTS

IMPACT-TTS: A Multimodal Prompt and Control Approach for Overcoming Low-Resource Constraints in Emotional TTS

ACL ARR 2025 February Submission8204 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Advancing emotional expressiveness in Text-to-Speech (TTS) systems remains a pivotal challenge for achieving natural and adaptive voice synthesis. Existing emotion-aware TTS models often struggle with limited emotional diversity, lack of fine-grained control, and reliance on small, labeled emotional speech-text datasets, making them less scalable and adaptable. To address these limitations, we propose IMPACT-TTS, an Integrated Multimodal Prompt and Control for Emotional TTS system that effectively leverages a disentangled emotion module and a novel emotion modulation function. By incorporating large-scale pretrained multimodal models, IMPACT-TTS mitigates dataset constraints while enabling flexible emotional adjustments via prompt-based control. Our approach allows seamless blending of emotional intensities, significantly enhancing expressiveness even in low-resource labeled datasets. Experimental results demonstrate that IMPACT-TTS outperforms existing models in emotional naturalness and adaptability, offering a scalable solution for emotion-aware TTS.

Paper Type: Long

Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding

Research Area Keywords: Text-to-Speech, Multimodal, Emotional TTS, Low Resource

Contribution Types: Approaches to low-resource settings

Languages Studied: English

Submission Number: 8204

Loading