Keywords: many-to-many voice conversion, federated learning, human-in-the-loop, distributed machine learning, StarGANv2-VC
Abstract: We propose a method for training a many-to-many voice conversion (VC) model that can additionally learn users' voices while protecting the privacy of their data. Conventional many-to-many VC methods train a VC model using a publicly available or proprietary multi-speaker corpus. However, they do not always achieve high-quality VC for input speech from various users. Our method is based on federated learning, a framework of distributed machine learning where a developer and users cooperatively train a machine learning model while protecting the privacy of user-owned data. We present a proof-of-concept method on the basis of StarGANv2-VC (i.e., Fed-StarGANv2-VC) and demonstrate that our method can achieve speaker similarity comparable to conventional non-federated StarGANv2-VC.
Supplementary Material: zip