Scaling Zero-Shot TTS with Speaker-Agnostic Training

ACL ARR 2025 May Submission2481 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The goal of language model (LM)-based zero-shot text-to-speech (TTS) is to synthesize speech with voices unseen during training. However, zero-shot TTS requires labeled speaker information for each utterance during training. This information is expensive to acquire, making it difficult to scale systems to large amounts of data. In this paper, we show that these issues can be overcome by simply combining a large dataset without speaker labels and a smaller dataset with speaker labels, before training a TTS model on the mixture. To prevent information mismatch between the two types of data, we introduce new data augmentation techniques to regularize model training: speaker dropout and speaker scrambling. As a result, we achieve relative gains up to 64\% better speaker similarity and 80\% lower WER, when compared to standard training recipes. We show that our method not only generalizes well to low-resource and cross-lingual settings, but also scales to over 200K hours of training data. We will open-source all code and pre-trained models.. Audio samples are available at https://cccmon7.github.io/opus_tts/.
Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: Text to Speech, Scaling, TTS, Speech Language Model
Contribution Types: Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: English, German
Submission Number: 2481
Loading