Evidential-TTS: High Fidelity Zero-Shot Text-to-Speech Using Evidential Deep Learning

Published: 2025, Last Modified: 25 May 2026ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We propose Evidential-TTS, a novel zero-shot text-to-speech (TTS) system based on evidential deep learning (EDL). The model includes a length regulator to ensure precise alignment between phonemes and acoustic tokens. This module allows the evidential token generator to convert the aligned phoneme sequence into acoustic tokens using iterative parallel decoding (IPD). However, IPD often suffers from overconfidence when using categorical probabilities as confidence scores. To address this, we introduce model uncertainty into the sampling process, quantified through EDL optimization. This uncertainty provides a more reliable sampling path for high-quality speech generation. Experimental results show that Evidential-TTS outperforms existing models in terms of speech naturalness and intelligibility. An ablation study further demonstrates the importance of uncertainty estimation in guiding the sampling trajectory of IPD.
Loading