T2V2: A Unified Non-Autoregressive Model for Speech Recognition and Synthesis via Multitask Learning

Zero-Shot TTS examples (Table 4).

Text input: Very much of squalor and discomfort will be endured before the last trinket or the last pretense of pecuniary decency is put away.

Speaker Prompt Ours HierSpeech++ WhisperSpeech XTTSv2 StyleTTS2 YourTTS

Text input: Rodolfo meanwhile having returned home, and having missed the crucifix, guessed who had taken it, but gave himself no concern about it.

Speaker Prompt Ours HierSpeech++ WhisperSpeech XTTSv2 StyleTTS2 YourTTS

Text input: The railroads had not reached Jackson county, and wild game was plentiful on my father's farm on Big Creek near Lee's Summit.

Speaker Prompt Ours HierSpeech++ WhisperSpeech XTTSv2 StyleTTS2 YourTTS

Text input: Then he reappeared, creeping along the earth, from which his dress was hardly distinguishable, directly in the rear of his intended captive.

Speaker Prompt Ours HierSpeech++ WhisperSpeech XTTSv2 StyleTTS2 YourTTS

Ablation Study (Task)

Examples sampled from evaluation data of Table 1. Number of iterations=1

Speaker Prompt w/o CTC Correction, w/o Speech MLM w/o CTC Correction, w Speech MLM w CTC Correction, w Speech MLM

Ablation Study (Iterations)

Examples sampled from evaluation data of Table 2

Speaker Prompt Iters=1 Iters=4 Iters=8

Ablation Study (CFG weight)

Examples sampled from evaluation data of Table 3

Speaker Prompt CFG=0.0 (No CFG) CFG=1.0 CFG=1.5 CFG=2.0