Region Normalized Capsule Network Based Generative Adversarial Network for Non-parallel Voice Conversion
Abstract: Voice conversion (VC) involves altering the vocal characteristics of a source speaker to resemble those of a target speaker while maintaining the same linguistic content. Recently, researchers have turned to deep generative models, particularly generative adversarial network (GAN) models, for VC studies due to their superior performance compared to statistical models. However, there is a noticeable disparity in naturalness between real speech samples and those generated by state-of-the-art (SOTA) VC models. This study introduces an enhanced GAN model for non-parallel VC, which employs mel-spectrograms as the speech feature. The enhanced GAN model incorporates a region normalization technique in the generator and a discriminator based on capsule networks (Caps-Net), to improve the quality of the generated speech samples. The proposed model is evaluated using the VCC 2018 and CMU Arctic datasets. The experimental outcomes demonstrate that the region normalization technique-based Caps-Net GAN (RNCapsGAN-VC) model outperforms the SOTA MaskCycleGAN-VC model in terms of both objective and subjective evaluations considering less training time.
Loading