VALL-E 2

Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

Abstract. This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time. Based on its predecessor, VALL-E, this work introduces two significant enhancements: Repetition Aware Sampling refines the original nucleus sampling process by accounting for token repetition in the decoding history. It not only stabilizes the decoding but also circumvents the infinite loop issue. Grouped Code Modeling organizes codec codes into groups to effectively shorten the sequence length, which not only boosts inference speed but also addresses the challenges of long sequence modeling. Our experiments on the LibriSpeech and VCTK datasets show that VALL-E 2 surpasses previous systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases. The advantages of this work could contribute to valuable endeavors, such as generating speech for individuals with aphasia or people with amyotrophic lateral sclerosis.

This page is for research demonstration purposes only.

Overview

VALL-E 2 achieves human parity zero-shot TTS performance for the first time.

Hard Examples

VALL-E 2 can synthesize personalized speech even with the hard text from ELLA-V. The speaker prompts are sampled from the librispeech dataset.

Text Speaker Prompt VALL-E VALL-E 2
F one F two F four F eight H sixteen H thirty two H sixty four
Clever cats carefully crafted colorful collages creating cheerful compositions
Curious koalas curiously climbed curious curious climbers
Sad snakes sadly sighed sad sad sighs
Joyful jaguars joyfully jumped joyful joyful jumps
Noisy newts nonsensically nibbled noisy noisy nibbles
Crafting a symphony of flavors the skilled chef orchestrated a culinary masterpiece that left an indelible mark mark mark mark mark on the palates of the discerning diners
The future belongs to belongs to belongs to belongs to belongs to those who believe in the beauty of the beauty of the beauty of the beauty of the beauty of their dreams

LibriSpeech Samples

VALL-E 2 can perform zero-shot speech continuation with the first 3-second prefix as the speaker prompt, and speech synthesis with a reference utterance of an unseen speaker as the speaker prompt. The audio and transcriptions are sampled from the librispeech dataset.

Text Speaker Prompt (Prefix/Ref) VALL-E VALL-E 2
(Group Size ×1)
VALL-E 2
(Group Size ×2)
VALL-E 2
(Group Size ×4)
They moved thereafter cautiously about the hut groping before and about them to find something to show that Warrenton had fulfilled his mission
And lay me down in thy cold bed and leave my shining lot
Number ten fresh nelly is waiting on you good night husband
Yea his honourable worship is within but he hath a godly minister or two with him and likewise a leech

VCTK Samples

Zero-shot TTS from 3-second, 5-second and 10-second speaker prompts. The audio and transcriptions are sampled from the VCTK dataset.

Text Speaker Prompt (3s/5s/10s) VALL-E VALL-E 2
(Group Size 1)
VALL-E 2
(Group Size 2)
VALL-E 2
(Group Size 4)
We have to reduce the number of plastic bags
So what is the campaign about
My life has changed a lot
Nothing is yet confirmed

Ethics Statement

Since VALL-E 2 could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agree to be the target speaker in speech synthesis. If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model.