EDM-TTS: Efficient Dual-Stage Masked Modeling for Alignment-Free Text-to-Speech Synthesis
Zero-Shot TTS examples sampled randomly from the evaluation data of Table 3.
Text input: If you thought I lived in New York, why in the world didn't you come and see me ? the lady inquired.
Speaker Prompt
Ours
HierSpeech++
WhisperSpeech
XTTSv2
StyleTTS2
Text input: However loudly outward circumstances might oppose this, he now felt, with a certainty which surprised him, that this work was not his own.
Speaker Prompt
Ours
HierSpeech++
WhisperSpeech
XTTSv2
StyleTTS2
Text input: The railroads had not reached Jackson county, and wild game was plentiful on my father's farm on Big Creek near Lee's Summit.
Speaker Prompt
Ours
HierSpeech++
WhisperSpeech
XTTSv2
StyleTTS2
Text input: Then he reappeared, creeping along the earth, from which his dress was hardly distinguishable, directly in the rear of his intended captive.
Speaker Prompt
Ours
HierSpeech++
WhisperSpeech
XTTSv2
StyleTTS2
Text input: Voice Conversion examples sampled randomly from the evaluation data of Table 2.
Source Utterance
Target Speaker
Ours
HierSpeech++
DiffHierVC
SoundStorm
Ablation Study (Reconstruction)
Injection Conformer, examples randomly sampled from evaluation data of Table 5.
Speaker Prompt
Ground Truth
no-inj
inj1
inj2
inj3
noskip
Ours (Iters=8)
Iters=4
Iters=2
Iters=1
Ablation Study (Resynthesis)
Text-to-Semantic, examples randomly sampled from evaluation data of Table 6.