Intensity Controllable Emotional Speech Synthesis Based on Valence-Arousal-Dominance

Published: 01 Jan 2024, Last Modified: 08 Apr 2025BICS (1) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Speech spoofing technologies have advanced significantly, enabling the creation of fake audio that closely mimics authentic human voices. Nevertheless, these synthetic speeches often lack precise control over emotional intensity. This paper introduces a technique to modulate the intensity of emotions in synthesized speech using a three-dimensional (3D) emotion representation: Valence-Arousal-Dominance (VAD). The process entails mapping emotion embeddings onto a continuous 3D emotion continuum and fine-tuning the dimensionality values within specific ranges to regulate emotional intensity. Leveraging a feature fusion network grounded on an emotion2vec pre-trained model, we devise a transformation model from labeled data to convert VAD vectors into emotion embeddings. Experimental results confirm that our method enhances the quality of synthetic speech production and affords superior command over emotional intensity.
Loading