Generative Pre-training for Speech with Flow Matching

Abstract:

Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there exists no general-purpose generative model that models speech directly. In this work, we take a step toward this direction by showing a single pre-trained generative model can be adapted to different downstream tasks with strong performance. Specifically, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis. Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training.

We recommend loading the demo pages with Chrome since Safari sometimes freezes during loading.

Speech Enhancement

We rank samples in the WSJ0CHiME3 test set from easy to hard using PESQ of noisy speech, and show the sample at the 0/20/40/60/80/100th percentile rank.

Percentile rank (easy to hard)
0% 20% 40% 60% 80% 100%
Sample ID
443c020x 440c0203 443c020t 443c020m 444o0308 441c0208
Noisy speech
Models trained on Voicebank-Demand
MetricGAN+
(Fu et al., 2021)
SGMSE+
(Richter et al., 2023)
SpeechFlow
(HiFi-GAN, for demo only)
SpeechFlow
(invMel+noisy phase+iSTFT, as in paper)
Models trained on DNS2020
DEMUCS
(Défossez et al., 2020)
SpeechFlow
(HiFi-GAN, for demo only)
Ground truth
Waveform



Speech Separation

Samples are from internal dataset, all speakers are unseen speakers to the models. An interesting observation is that while the background noise may sound different from the reference recording, it does sound coherent through out our prediction. It makes sense that the model cannot discern what noise belongs to which speaker. Our better coherence also indicates the model learns the structure of audio better than other models.

Sample #1 Sample #2 Sample #3 Sample #4 Sample #5
Mixture
ConvTasNet
(Luo & Mesgarani, 2019)
speaker 1
speaker 2
SepFormer
(Subakan et al., 2021)
speaker 1
speaker 2
SpeechFlow speaker 1
speaker 2
Ground truth speaker 1
speaker 2

Zero-shot Text-to-Speech Synthesis

Refernce speakers are from internal dataset, all speakers are unseen speakers to the models.

Text Prompt Voicebox SpeechFlow
60k hours labeled data 960 hours labeled data
Thus did this humane and right minded father comfort his unhappy daughter and her mother embracing her again did all she could to soothe her feelings
They moved thereafter cautiously about the hut groping before and about them to find something to show that warrenton had fulfilled his mission
And lay me down in thy cold bed and leave my shining lot
And the whole night the tree stood still and in deep thought
Instead of shoes the old man wore boots with turnover tops and his blue coat had wide cuffs of gold braid
The army found the people in poverty and left them in comparative wealth
Yea his honourable worship is within but he hath a godly minister or two with him and likewise a leech
He was in deep converse with the clerk and entered the hall holding him by the arm
Number ten fresh nelly is waiting on you good night husband


Additional Discussion on Mel-to-waveform

To showcase how neural vocoders are not ideal choices for some common metrics of generative tasks, here is a side-by-side comparison of HiFi-GAN and the default signal processing method (pseudo-inversed Mel-to-linear transform + phase information from noisy speech + iSTFT) on speech enhancement. For both sampled data and real data, we can hear neural vocoder delivered better speech quality but all three metrics considered are significantly worse.

PESQ / ESTOI / COVL 443c020x 440c0203 443c020t 443c020m 444o0308 441c0208
Sampled data
Mel Spectrogram
(invMel+noisy phase+iSTFT)
2.70 / 0.90 / 3.36
Mel Spectrogram
(HiFi-GAN)
2.29 / 0.81 / 2.96
Real data
Mel Spectrogram
(invMel+noisy phase+iSTFT)
3.68 / 0.96 / 4.46
Mel Spectrogram
(HiFi-GAN)
2.80 / 0.73 / 3.69
Waveform 4.5 / 1.00 / 5.00