EDM-TTS: Efficient Dual-Stage Masked Modeling for Alignment-Free Text-to-Speech Synthesis

Zero-Shot TTS examples sampled randomly from the evaluation data of Table 3.

Text input: If you thought I lived in New York, why in the world didn't you come and see me ? the lady inquired.

Speaker Prompt Ours HierSpeech++ WhisperSpeech XTTSv2 StyleTTS2

Text input: However loudly outward circumstances might oppose this, he now felt, with a certainty which surprised him, that this work was not his own.

Speaker Prompt Ours HierSpeech++ WhisperSpeech XTTSv2 StyleTTS2

Text input: The railroads had not reached Jackson county, and wild game was plentiful on my father's farm on Big Creek near Lee's Summit.

Speaker Prompt Ours HierSpeech++ WhisperSpeech XTTSv2 StyleTTS2

Text input: Then he reappeared, creeping along the earth, from which his dress was hardly distinguishable, directly in the rear of his intended captive.

Speaker Prompt Ours HierSpeech++ WhisperSpeech XTTSv2 StyleTTS2

Text input: Voice Conversion examples sampled randomly from the evaluation data of Table 2.

Source Utterance Target Speaker Ours HierSpeech++ DiffHierVC SoundStorm

Ablation Study (Reconstruction)

Injection Conformer, examples randomly sampled from evaluation data of Table 5.

Speaker Prompt Ground Truth no-inj inj1 inj2 inj3 noskip Ours (Iters=8) Iters=4 Iters=2 Iters=1

Ablation Study (Resynthesis)

Text-to-Semantic, examples randomly sampled from evaluation data of Table 6.

Speaker Prompt Ground Truth GT Length 0.7x GTLength 1.3x GT Length Ours (Iters=16, Pred Length) Itrs=8 Iters=4