T2V2: A Unified Non-Autoregressive Model for Speech Recognition and Synthesis via Multitask Learning
Zero-Shot TTS examples (Table 4).
Text input: Very much of squalor and discomfort will be endured before the last trinket or the last pretense of pecuniary decency is put away.
Speaker Prompt
Ours
HierSpeech++
WhisperSpeech
XTTSv2
StyleTTS2
YourTTS
Text input: Rodolfo meanwhile having returned home, and having missed the crucifix, guessed who had taken it, but gave himself no concern about it.
Speaker Prompt
Ours
HierSpeech++
WhisperSpeech
XTTSv2
StyleTTS2
YourTTS
Text input: The railroads had not reached Jackson county, and wild game was plentiful on my father's farm on Big Creek near Lee's Summit.
Speaker Prompt
Ours
HierSpeech++
WhisperSpeech
XTTSv2
StyleTTS2
YourTTS
Text input: Then he reappeared, creeping along the earth, from which his dress was hardly distinguishable, directly in the rear of his intended captive.
Speaker Prompt
Ours
HierSpeech++
WhisperSpeech
XTTSv2
StyleTTS2
YourTTS
Ablation Study (Task)
Examples sampled from evaluation data of Table 1. Number of iterations=1