Keywords: voice conversion, gradient reversal, adversarial learning, speech synthesis
Abstract: Voice conversion(VC) aims to convert one speaker's voice to generate a new speech as it is said by another speaker. Previous works focus on learning latent representation by applying two different encoders to learn content information and timbre information from the input speech respectively. However, whether they apply a bottleneck network or vector quantify technology, it is very difficult to perfectly separate the speaker and the content information from a speech signal. In this paper, we propose a novel voice conversion framework, 'ClsVC', to address this problem. It uses only one encoder to get both timbre and content information by dividing the latent space. Besides, some constraints are proposed to ensure the different part of latent space only contains separating content and timbre information respectively. We have shown the necessity to set these constraints, and we also experimentally prove that even if we change the division proportion of latent space, the content and timbre information will be always well separated. Experiments on the VCTK dataset show ClsVC is a state-of-the-art framework in terms of the naturalness and similarity of converted speech.
One-sentence Summary: a novel framework for voice conversion
Supplementary Material: zip
7 Replies
Loading