Abstract:
We rank samples in the WSJ0CHiME3 test set from easy to hard using PESQ of noisy speech, and show the sample at the 0/20/40/60/80/100th percentile rank.
Percentile rank (easy to hard) | ||||||
---|---|---|---|---|---|---|
0% | 20% | 40% | 60% | 80% | 100% | |
Sample ID | ||||||
443c020x | 440c0203 | 443c020t | 443c020m | 444o0308 | 441c0208 | |
Noisy speech | ||||||
Models trained on Voicebank-Demand | ||||||
MetricGAN+ (Fu et al., 2021) | ||||||
SGMSE+ (Richter et al., 2023) | ||||||
SpeechFlow (HiFi-GAN, for demo only) | ||||||
SpeechFlow (invMel+noisy phase+iSTFT, as in paper) | ||||||
Models trained on DNS2020 | ||||||
DEMUCS (Défossez et al., 2020) | ||||||
SpeechFlow (HiFi-GAN, for demo only) | ||||||
Ground truth | ||||||
Waveform |
Samples are from internal dataset, all speakers are unseen speakers to the models. An interesting observation is that while the background noise may sound different from the reference recording, it does sound coherent through out our prediction. It makes sense that the model cannot discern what noise belongs to which speaker. Our better coherence also indicates the model learns the structure of audio better than other models.
Sample #1 | Sample #2 | Sample #3 | Sample #4 | Sample #5 | ||
---|---|---|---|---|---|---|
Mixture | ||||||
ConvTasNet (Luo & Mesgarani, 2019) | speaker 1 | |||||
speaker 2 | ||||||
SepFormer (Subakan et al., 2021) | speaker 1 | |||||
speaker 2 | ||||||
SpeechFlow | speaker 1 | |||||
speaker 2 | ||||||
Ground truth | speaker 1 | |||||
speaker 2 |
Refernce speakers are from internal dataset, all speakers are unseen speakers to the models.
Text | Prompt | Voicebox | SpeechFlow |
---|---|---|---|
60k hours labeled data | 960 hours labeled data | ||
Thus did this humane and right minded father comfort his unhappy daughter and her mother embracing her again did all she could to soothe her feelings | |||
They moved thereafter cautiously about the hut groping before and about them to find something to show that warrenton had fulfilled his mission | |||
And lay me down in thy cold bed and leave my shining lot | |||
And the whole night the tree stood still and in deep thought | |||
Instead of shoes the old man wore boots with turnover tops and his blue coat had wide cuffs of gold braid | |||
The army found the people in poverty and left them in comparative wealth | |||
Yea his honourable worship is within but he hath a godly minister or two with him and likewise a leech | |||
He was in deep converse with the clerk and entered the hall holding him by the arm | |||
Number ten fresh nelly is waiting on you good night husband |
To showcase how neural vocoders are not ideal choices for some common metrics of generative tasks, here is a side-by-side comparison of HiFi-GAN and the default signal processing method (pseudo-inversed Mel-to-linear transform + phase information from noisy speech + iSTFT) on speech enhancement. For both sampled data and real data, we can hear neural vocoder delivered better speech quality but all three metrics considered are significantly worse.
PESQ / ESTOI / COVL | 443c020x | 440c0203 | 443c020t | 443c020m | 444o0308 | 441c0208 | |
---|---|---|---|---|---|---|---|
Sampled data | |||||||
Mel Spectrogram (invMel+noisy phase+iSTFT) | 2.70 / 0.90 / 3.36 | ||||||
Mel Spectrogram (HiFi-GAN) | 2.29 / 0.81 / 2.96 | ||||||
Real data | |||||||
Mel Spectrogram (invMel+noisy phase+iSTFT) | 3.68 / 0.96 / 4.46 | ||||||
Mel Spectrogram (HiFi-GAN) | 2.80 / 0.73 / 3.69 | ||||||
Waveform | 4.5 / 1.00 / 5.00 |