FID-RPRGAN-VC: Fréchet Inception Distance Loss based Region-wise Position Normalized Relativistic GAN for Non-Parallel Voice Conversion

Sandipan Dhar, MD. Tousin Akhter, Padmanabha Banerjee, Nanda Dulal Jana, Swagatam Das

Published: 2023, Last Modified: 08 Apr 2025APSIPA ASC 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Voice conversion (VC) is the speech-to-speech (STS) synthesis process that converts the vocal identity of a source speaker to a target speaker by keeping the linguistic content unaltered. In recent years, VC research has been explored using generative adversarial network (GAN) models. However, a substantial difference exists between the real and the state-of-the-art (SOTA) VC model-generated speech samples as far as naturalness is concerned. This work proposes an improved GAN model for non-parallel VC to enhance the naturalness of the generated speech samples. The improved GAN model is integrated with a region-wise positional normalization technique in the generator, a relativistic mechanism-based discriminator, and a Fréchet inception distance (FID) based loss function. We tested the proposed model on VCC 2018, CMU Arctic, and a dysarthric speech dataset. The experimental results revealed the superiority of the proposed FID-RPRGAN-VC model over the SOTA MaskCycleGAN-VC model.