EZ-VC: Easy Zero-shot Any-to-Any Voice Conversion

ACL ARR 2025 May Submission6671 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Voice Conversion research in recent times has increasingly focused on improving the zero-shot capabilities of existing methods. Despite remarkable advancements, current architectures still tend to struggle in zero-shot cross-lingual settings. They are also often unable to generalize for speakers of unseen languages and accents. In this paper, we adopt a simple yet effective approach that combines discrete speech representations from self-supervised models with a non-autoregressive Diffusion-Transformer based conditional flow matching speech decoder. We show that this architecture allows us to train a voice-conversion model in a purely textless, self-supervised fashion. Our technique works without requiring multiple encoders to disentangle speech features. Our model also manages to excel in zero-shot cross-lingual settings even for unseen languages. We provide demo samples for our model here: https://ez-vc.github.io/EZ-VC-Demo/
Paper Type: Short
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: speech technologies, spoken dialog
Languages Studied: English, Hindi, Bengali, Tamil, Telugu, Kannada, German, Spanish
Submission Number: 6671
Loading