USAT: A Universal Speaker-Adaptive Text-to-Speech Approach

Published: 2024, Last Modified: 21 Jan 2026IEEE ACM Trans. Audio Speech Lang. Process. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Conventional text-to-speech (TTS) research has predominantly focused on enhancing the quality of synthesized speech for speakers in the training dataset. The challenge of synthesizing lifelike speech for unseen, out-of-dataset speakers, especially those with limited reference data, remains a significant and unresolved problem. While zero-shot or few-shot speaker-adaptive TTS approaches have been explored, they have many limitations. Zero-shot approaches tend to suffer from insufficient generalization performance to reproduce the voice of speakers with heavy accents. While few-shot methods can reproduce highly varying accents, they bring a significant storage burden and the risk of overfitting and catastrophic forgetting. Besides, most current evaluations of speaker-adaptive TTS are conducted only on datasets of native speakers. Our proposed framework unifies both zero-shot and few-shot speaker adaptation strategies, which we term as “instant” and “fine-grained” adaptations, respectively, based on their merits. To alleviate the insufficient generalization performance observed in zero-shot speaker adaptation, we designed two innovative discriminators and introduced a memory mechanism for the speech decoder. To prevent catastrophic forgetting and reduce storage implications for few-shot speaker adaptation, we designed two adapters and a unique adaptation procedure. Additionally, we introduce a new TTS dataset that encompasses 44,000 English utterances from 134 non-native speakers, capturing a wide array of non-native English accents. This dataset is intended to enhance holistic evaluations of adaptive TTS capabilities. Through comprehensive experiments on multiple datasets comprising both native and non-native speakers, our approach outperforms contemporary methodologies across various subjective and objective metrics.
Loading