Mixed-EVC: Mixed Emotion Synthesis and Control in Voice Conversion

Kun Zhou, Berrak Sisman, Carlos Busso, Bin Ma, Haizhou Li

Published: 01 Jan 2024, Last Modified: 12 May 2025Odyssey 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Emotional voice conversion (EVC) traditionally targets the transformation of spoken utterances from one emotional state to another, with previous research mainly focusing on discrete emotion categories. We propose a novel EVC framework, Mixed-EVC, which only leverages discrete emotion training labels. We construct an attribute vector that encodes the relationships among these discrete emotions, which is predicted using a ranking-based support vector machine and then integrated into a sequence-to-sequence (seq2seq) EVC framework. Mixed-EVC not only learns to characterize the input emotional style but also quantifies its relevance to other emotions during training. As a result, users have the ability to assign these attributes to achieve their desired rendering of mixed emotions. Objective and subjective evaluations confirm the effectiveness of our approach in terms of mixed emotion synthesis and control while surpassing traditional baselines in the conversion between discrete emotions.