Encoding Speaker-Specific Latent Speech Feature for Speech Synthesis

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Speech Synthesis, Text-To-Speech, Representation Learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: In this work, we propose a novel method for modeling numerous speakers, which enables expressing the overall characteristics of a speaker like a trained multi-speaker model without additional training on the target speaker's dataset. Although various works for a similar purpose have been actively studied, it does not perform as well as a trained multi-speaker model due to the limitations of its fundamental principles. We propose a method to encode speakers' overall speech characteristics into latent speech features, discretize them, and express the speech characteristics from the encoded features with a combining method. Our method performs superior to the best-performing multi-speaker model and outperforms a zero-shot model by significant margins in a general text-to-speech scenario. Furthermore, it shows remarkable performance in generating new artificial speakers. In addition, we demonstrate that the encoded latent features are sufficiently informative to reconstruct an original speaker's speech completely. It implies that our method can be used as a general methodology to encode and reconstruct speakers' characteristics in various tasks.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5050
Loading