Guided-TTS:Text-to-Speech with Untranscribed Speech

Heeseung Kim; Sungwon Kim; Sungroh Yoon

Guided-TTS:Text-to-Speech with Untranscribed Speech

Heeseung Kim, Sungwon Kim, Sungroh Yoon

Published: 28 Jan 2022, Last Modified: 22 Jun 2025ICLR 2022 SubmittedReaders: Everyone

Keywords: Text-to-Speech, Speech Synthesis, DDPM, TTS, Untranscribed speech

Abstract: Most neural text-to-speech (TTS) models require $\langle$speech, transcript$\rangle$ paired data from the desired speaker for high-quality speech synthesis, which limits the usage of large amounts of untranscribed data for training. In this work, we present Guided-TTS, a high-quality TTS model that learns to generate speech from untranscribed speech data. Guided-TTS combines an unconditional diffusion probabilistic model with a separately trained phoneme classifier for text-to-speech. By modeling the unconditional distribution for speech, our model can utilize the untranscribed data for training. For text-to-speech synthesis, we guide the generative process of the unconditional DDPM via phoneme classification to produce mel-spectrograms from the conditional distribution given transcript. We show that Guided-TTS achieves comparable performance with the existing methods without any transcript for LJSpeech. Our results further show that a single speaker-dependent phoneme classifier trained on multispeaker large-scale data can guide unconditional DDPMs for various speakers to perform TTS.

One-sentence Summary: Text-to-Speech with untranscribed speech data via phoneme classification

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/guided-tts-text-to-speech-with-untranscribed/code)

14 Replies

Loading