VALL-E 2
Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
Abstract.
This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time. Based on its predecessor, VALL-E, this work introduces two significant enhancements: Repetition Aware Sampling refines the original nucleus sampling process by accounting for token repetition in the decoding history. It not only stabilizes the decoding but also circumvents the infinite loop issue. Grouped Code Modeling organizes codec codes into groups to effectively shorten the sequence length, which not only boosts inference speed but also addresses the challenges of long sequence modeling. Our experiments on the LibriSpeech and VCTK datasets show that VALL-E 2 surpasses previous systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases. The advantages of this work could contribute to valuable endeavors, such as generating speech for individuals with aphasia or people with amyotrophic lateral sclerosis.
This page is for research demonstration purposes only.
VALL-E 2 achieves human parity zero-shot TTS performance for the first time.
Hard Examples
VALL-E 2 can synthesize personalized speech even with the hard text from ELLA-V. The speaker prompts are sampled from the librispeech dataset.
Text | Speaker Prompt | VALL-E | VALL-E 2 |
---|---|---|---|
F one F two F four F eight H sixteen H thirty two H sixty four | |||
Clever cats carefully crafted colorful collages creating cheerful compositions | |||
Curious koalas curiously climbed curious curious climbers | |||
Sad snakes sadly sighed sad sad sighs | |||
Joyful jaguars joyfully jumped joyful joyful jumps | |||
Noisy newts nonsensically nibbled noisy noisy nibbles | |||
Crafting a symphony of flavors the skilled chef orchestrated a culinary masterpiece that left an indelible mark mark mark mark mark on the palates of the discerning diners | |||
The future belongs to belongs to belongs to belongs to belongs to those who believe in the beauty of the beauty of the beauty of the beauty of the beauty of their dreams |
LibriSpeech Samples
VALL-E 2 can perform zero-shot speech continuation with the first 3-second prefix as the speaker prompt, and speech synthesis with a reference utterance of an unseen speaker as the speaker prompt. The audio and transcriptions are sampled from the librispeech dataset.
Text | Speaker Prompt (Prefix/Ref) | VALL-E | VALL-E 2 (Group Size ×1) |
VALL-E 2 (Group Size ×2) |
VALL-E 2 (Group Size ×4) |
---|---|---|---|---|---|
They moved thereafter cautiously about the hut groping before and about them to find something to show that Warrenton had fulfilled his mission | |||||
And lay me down in thy cold bed and leave my shining lot | |||||
Number ten fresh nelly is waiting on you good night husband | |||||
Yea his honourable worship is within but he hath a godly minister or two with him and likewise a leech | |||||
VCTK Samples
Zero-shot TTS from 3-second, 5-second and 10-second speaker prompts. The audio and transcriptions are sampled from the VCTK dataset.
Text | Speaker Prompt (3s/5s/10s) | VALL-E | VALL-E 2 (Group Size 1) |
VALL-E 2 (Group Size 2) |
VALL-E 2 (Group Size 4) |
---|---|---|---|---|---|
We have to reduce the number of plastic bags | |||||
So what is the campaign about | |||||
My life has changed a lot | |||||
Nothing is yet confirmed | |||||
Ethics Statement
Since VALL-E 2 could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agree to be the target speaker in speech synthesis. If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model.