Abstract: Emotional Voice Conversion aims to manipulate a speech
according to a given emotion while preserving non-emotion
components. Existing approaches cannot well express fine-
grained emotional attributes. In this paper, we propose an
Attention-based Interactive diseNtangling Network (AINN) that
leverages instance-wise emotional knowledge for voice conver-
sion. We introduce a two-stage pipeline to effectively train our
network: Stage I utilizes inter-speech contrastive learning to
model fine-grained emotion and intra-speech disentanglement
learning to better separate emotion and content. In Stage II, we
propose to regularize the conversion with a multi-view consis-
tency mechanism. This technique helps us transfer fine-grained
emotion and maintain speech content. Extensive experiments
show that our AINN outperforms state-of-the-arts in both ob-
jective and subjective metrics.
Loading