Data augmentation based non-parallel voice conversion with frame-level speaker disentangler

Bo Chen, Zhihang Xu, Kai Yu

Published: 2022, Last Modified: 18 May 2025Speech Commun. 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Highlights•We propose a data augmentation based technique for non-parallel voice conversion.•It produces time-aligned parallel data with the same frame-level speaking style.•We use the frame-level adversarial loss to reduce the speaker identity.•We propose two separate speaker embeddings before and after the attention mechanism.•We use stacked 2D CNNs with conditional 1D CNNs to extract local speaking style.•We can use a simple network to build voice conversion model with the augmented data.