Speaking Guided by Listening: Unsupervised Text-to-Speech Generative Model Guided by End-to-End Speech Recognition
Keywords: Text-to-speech, Diffusion, Unsupervised learning
Abstract: We propose to utilize end-to-end automatic speech recognition (E2EASR) as a guidance model to realize unsupervised text-to-speech (TTS). An unconditional score-based generative model (SGM) is trained with untranscribed speech data. In the sampling stage, the unconditional score estimated by the SGM is combined with the gradients from ASR models by the Bayes rule to get the conditional score. We use a set of small ASR models trained only on $80$-hour labeled ASR data to guide the unconditional SGM and generate speech with high-quality scores in both objective and subjective evaluation. Similarly, we can also use additional speaker verification models to control speaker identity for the synthesized speech. That allows us to do the zero-shot TTS for the target speaker with a few seconds of enrollment speech. Our best unsupervised synthesized speech gets $\sim8\%$ word error rate in testing, and the best speaker-controlled TTS gets $3.3$ mean opinion score (MOS) in the speaker similarly testing.
Supplementary Material: zip
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11485
Loading