DetailTTS: Learning Residual Detail Information for Zero-shot Text-to-speech

Cong Wang, Yichen Han, Yizhong Geng, Yingming Gao, Fengping Wang, Bingsong Bai, Qifei Li, Jinlong Xue, Yayue Deng, Zhengqi Wen, Ya Li

Published: 01 Jan 2025, Last Modified: 22 Jul 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Traditional text-to-speech (TTS) systems often face challenges in aligning text and speech, leading to the omission of critical linguistic and acoustic details. This misalignment creates an information gap, which existing methods attempt to address by incorporating additional inputs, but these often introduce data inconsistencies and increase complexity. To address these issues, we propose DetailTTS, a zero-shot TTS system based on a conditional variational autoencoder. It incorporates two key components: the Prior Detail Module and the Duration Detail Module, which capture residual detail information missed during alignment. These modules effectively enhance the model’s ability to retain fine-grained details, significantly improving speech quality while simplifying the model by obviating the need for additional inputs. Experiments on the WenetSpeech4TTS dataset show that DetailTTS outperforms traditional TTS systems in both naturalness and speaker similarity, even in zero-shot scenarios. Our source code and demo page are available at https://detailtts.github.io/.