Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion

Yun Chen, Lingxiao Yang, Qi Chen, Jianhuang Lai, Xiaohua Xie

Published: 19 Aug 2023, Last Modified: 12 Apr 2025Interspeech 2023EveryoneRevisionsCC BY 4.0

Abstract: Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components. Existing approaches cannot well express fine- grained emotional attributes. In this paper, we propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conver- sion. We introduce a two-stage pipeline to effectively train our network: Stage I utilizes inter-speech contrastive learning to model fine-grained emotion and intra-speech disentanglement learning to better separate emotion and content. In Stage II, we propose to regularize the conversion with a multi-view consis- tency mechanism. This technique helps us transfer fine-grained emotion and maintain speech content. Extensive experiments show that our AINN outperforms state-of-the-arts in both ob- jective and subjective metrics.