Abstract: Emotional text-to-speech (TTS) has advanced significantly, but challenges persist due to the complexity of emotions and limitations in emotional speech datasets and models. A key issue with previous studies is the reliance on limited emotional speech datasets or extensive manual annotations, which restrict generalization across different speakers and emotional styles. To address this, we propose EmoSphere++, an emotion-controllable zero-shot TTS model capable of generating expressive speech with fine-grained control over emotional style and intensity—without requiring manual annotations. We introduce a novel emotion-adaptive spherical vector that effectively captures emotional style and intensity, along with a joint attribute style encoder that enhances generalization to both seen and unseen speakers. To further improve emotion transfer in zero-shot scenarios, we introduce an additional disentanglement method to enhance the style transfer performance for zero-shot scenarios. Through both objective and subjective evaluations, we demonstrate the benefits of the proposed model in emotion style and intensity modeling, as well as its effectiveness in enhancing emotional expressiveness across both seen and unseen speakers.
External IDs:dblp:journals/taffco/ChoOKL25
Loading