<html >
  <head >
    <title >SpeechFlow</title>
    <style>
      thead {color:blue;}
      table {width:1300px;}
      table, th, td {border: 1px solid black; text-align:center;}div.abs {left: 0px; max-width: 800px; min-width: 600px; padding: 0px;}
      .max-length {
        max-width: 1300px; /* Adjust this value as needed */
        white-space: pre-wrap; /* Wrap text and preserve line breaks within the text */
      }
    </style>
  </head>
  <body >
    <h1 >Generative Pre-training for Speech with Flow Matching</br></h1>

    <p class="max-length"><b>Abstract:</b><div class="abs">Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there exists no general-purpose generative model that models speech directly. In this work, we take a step toward this direction by showing a single pre-trained generative model can be adapted to different downstream tasks with strong performance. Specifically, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis. Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training.</br></br></p>

    We recommend loading the demo pages with Chrome since Safari sometimes freezes during loading.</br></br>

    <h2><a id="main_eval">Speech Enhancement</a></h2>

    <p class="max-length"> We rank samples in the WSJ0CHiME3 test set from easy to hard using PESQ of noisy speech, and show the sample at the 0/20/40/60/80/100th percentile rank.</p>

    <table >
      <thead >
        <tr >
          <th style="width:200px" rowspan="4">  </th>
          <th style="width:150px" colspan="6"> Percentile rank (easy to hard) </th>
        </tr>
        <tr >
          <th style="width:150px" rowspan="1"> 0% </th>
          <th style="width:150px" colspan="1"> 20% </th>
          <th style="width:150px" colspan="1"> 40% </th>
          <th style="width:150px" colspan="1"> 60% </th>
          <th style="width:150px" colspan="1"> 80% </th>
          <th style="width:150px" colspan="1"> 100% </th>
        </tr>
        <tr >
          <th style="width:150px" colspan="6"> Sample ID </th>
        </tr>
        <tr >
          <th style="width:150px" rowspan="1">443c020x </th>
          <th style="width:150px" colspan="1">440c0203 </th>
          <th style="width:150px" colspan="1">443c020t </th>
          <th style="width:150px" colspan="1">443c020m </th>
          <th style="width:150px" colspan="1">444o0308 </th>
          <th style="width:150px" colspan="1">441c0208 </th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td >Noisy speech</td>
          <td><audio controls style="width:150px;"><source src="wavs/se/ns/443c020x.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/ns/440c0203.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/ns/443c020t.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/ns/443c020m.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/ns/444o0308.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/ns/441c0208.wav" type="audio/wav"></audio></td>
        </tr>
        <tr>
          <td colspan="7" style="text-align: left; padding-left: 10px; font-size: 16px;">Models trained on Voicebank-Demand</td>
        </tr>
        <tr>
          <td > <a href="https://huggingface.co/speechbrain/metricgan-plus-voicebank" target="_blank" rel="noopener noreferrer">MetricGAN+</a> </br> (Fu et al., 2021)</td>
          <td><audio controls style="width:150px;"><source src="wavs/se/mg/443c020x.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/mg/440c0203.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/mg/443c020t.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/mg/443c020m.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/mg/444o0308.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/mg/441c0208.wav" type="audio/wav"></audio></td>
        </tr>
        <tr>
          <td > <a href="https://github.com/sp-uhh/sgmse" target="_blank" rel="noopener noreferrer">SGMSE+</a> </br> (Richter et al., 2023)</td>
          <td><audio controls style="width:150px;"><source src="wavs/se/sm/443c020x.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/sm/440c0203.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/sm/443c020t.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/sm/443c020m.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/sm/444o0308.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/sm/441c0208.wav" type="audio/wav"></audio></td>
        </tr>
        <tr>
          <td > SpeechFlow </br> (HiFi-GAN, for demo only) </td>
          <td><audio controls style="width:150px;"><source src="wavs/se/sfhf/443c020x.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/sfhf/440c0203.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/sfhf/443c020t.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/sfhf/443c020m.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/sfhf/444o0308.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/sfhf/441c0208.wav" type="audio/wav"></audio></td>
        </tr>
        <tr>
          <td> SpeechFlow </br> (invMel+noisy phase+iSTFT, as in paper)</td>
          <td><audio controls style="width:150px;"><source src="wavs/se/sf/443c020x.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/sf/440c0203.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/sf/443c020t.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/sf/443c020m.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/sf/444o0308.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/sf/441c0208.wav" type="audio/wav"></audio></td>
        </tr>
        <tr>
          <td colspan="7" style="text-align: left; padding-left: 10px; font-size: 16px;">Models trained on DNS2020</td>
        </tr>
        <tr>
          <td > <a href="https://github.com/facebookresearch/denoiser" target="_blank" rel="noopener noreferrer">DEMUCS</a> </br> (D&eacutefossez et al., 2020)</td>
          <td><audio controls style="width:150px;"><source src="wavs/se/demucs/443c020x.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/demucs/440c0203.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/demucs/443c020t.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/demucs/443c020m.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/demucs/444o0308.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/demucs/441c0208.wav" type="audio/wav"></audio></td>
        </tr>
        <tr>
          <td > SpeechFlow </br> (HiFi-GAN, for demo only) </td>
          <td><audio controls style="width:150px;"><source src="wavs/se/sfdns/443c020x.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/sfdns/440c0203.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/sfdns/443c020t.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/sfdns/443c020m.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/sfdns/444o0308.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/sfdns/441c0208.wav" type="audio/wav"></audio></td>
        </tr>
        <tr>
          <td colspan="7" style="text-align: left; padding-left: 10px; font-size: 16px;">Ground truth</td>
        </tr>
        <tr>
          <td > Waveform </td>
          <td><audio controls style="width:150px;"><source src="wavs/se/gt/443c020x.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/gt/440c0203.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/gt/443c020t.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/gt/443c020m.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/gt/444o0308.wav" type="audio/wav"></audio></td>
          <td><audio controls style="width:150px;"><source src="wavs/se/gt/441c0208.wav" type="audio/wav"></audio></td>
        </tr>

      </tbody>
    </table>

  </br></br>
  </br>


  </div>
  <h2><a id="main_eval">Speech Separation</a></h2>

  <p class="max-length"> Samples are from internal dataset, all speakers are unseen speakers to the models. An interesting observation is that while the background noise may sound different from the reference recording, it does sound coherent through out our prediction. It makes sense that the model cannot discern what noise belongs to which speaker. Our better coherence also indicates the model learns the structure of audio better than other models.
  </p>


  <table >
    <thead >
      <tr >
        <th style="width:300px" colspan="2">  </th>
        <th style="width:200px" rowspan="1">Sample #1 </th>
        <th style="width:200px" colspan="1">Sample #2 </th>
        <th style="width:200px" colspan="1">Sample #3 </th>
        <th style="width:200px" colspan="1">Sample #4 </th>
        <th style="width:200px" colspan="1">Sample #5 </th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td colspan="2">Mixture</td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/mix/1.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/mix/2.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/mix/3.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/mix/4.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/mix/5.wav" type="audio/wav"></audio></td>
      </tr>
      <tr>
        <td rowspan="2"> <a href="https://huggingface.co/JorisCos/ConvTasNet_Libri2Mix_sepclean_16k" target="_blank" rel="noopener noreferrer">ConvTasNet</a> </br> (Luo & Mesgarani, 2019)</td>
        <td> speaker 1 </td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/cv/s2/1.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/cv/s1/2.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/cv/s2/3.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/cv/s1/4.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/cv/s2/5.wav" type="audio/wav"></audio></td>
      </tr>
      <tr>
        <td> speaker 2 </td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/cv/s1/1.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/cv/s2/2.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/cv/s1/3.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/cv/s2/4.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/cv/s1/5.wav" type="audio/wav"></audio></td>
      </tr>
      <tr>
        <td rowspan="2"> <a href="https://github.com/speechbrain/speechbrain/tree/develop/recipes/LibriMix/separation" target="_blank" rel="noopener noreferrer">SepFormer</a></br> (Subakan et al., 2021) </td>
        <td> speaker 1 </td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/sp/s1/1.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/sp/s1/2.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/sp/s2/3.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/sp/s1/4.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/sp/s1/5.wav" type="audio/wav"></audio></td>
      </tr>
      <tr>
        <td> speaker 2 </td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/sp/s2/1.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/sp/s2/2.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/sp/s1/3.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/sp/s2/4.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/sp/s2/5.wav" type="audio/wav"></audio></td>
      </tr>
      <tr>
        <td rowspan="2"> SpeechFlow </td>
        <td> speaker 1 </td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/sf/s2/1.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/sf/s2/2.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/sf/s1/3.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/sf/s1/4.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/sf/s1/5.wav" type="audio/wav"></audio></td>
      </tr>
      <tr>
        <td> speaker 2 </td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/sf/s1/1.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/sf/s1/2.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/sf/s2/3.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/sf/s2/4.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/sf/s2/5.wav" type="audio/wav"></audio></td>
      </tr>
      <tr>
        <td rowspan="2"> Ground truth </td>
        <td> speaker 1 </td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/gt/s1/1.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/gt/s1/2.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/gt/s1/3.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/gt/s1/4.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/gt/s1/5.wav" type="audio/wav"></audio></td>
      </tr>
      <tr>
        <td> speaker 2 </td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/gt/s2/1.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/gt/s2/2.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/gt/s2/3.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/gt/s2/4.wav" type="audio/wav"></audio></td>
        <td><audio controls style="width:200px;"><source src="wavs/ss/gt/s2/5.wav" type="audio/wav"></audio></td>
      </tr>

    </tbody>
  </table>




</div>
<h2><a id="main_eval">Zero-shot Text-to-Speech Synthesis</a></h2>

<p class="max-length"> Refernce speakers are from internal dataset, all speakers are unseen speakers to the models. </p>

<table >
  <thead >
    <tr >
      <th style="width:700px" rowspan="2"> Text </th>
      <th style="width:200px" rowspan="2">Prompt </th>
      <th style="width:200px" colspan="1">Voicebox </th>
      <th style="width:200px" colspan="1">SpeechFlow </th>
    </tr>
    <tr >
      <th style="width:200px" colspan="1"> 60k hours labeled data </th>
      <th style="width:200px" colspan="1"> 960 hours labeled data </th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left;vertical-align:middle; ">
      <font size="2">
        Thus did this humane and right minded father comfort his unhappy daughter and her mother embracing her again did all she could to soothe her feelings
      </font>
      </td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/prompt/5639-40744-0020.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/vb/5639-40744-0020.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/sf/5639-40744-0020.wav" type="audio/wav"></audio></td>
    </tr>
    <tr>
      <td style="text-align: left;vertical-align:middle; ">
      <font size="2">
        They moved thereafter cautiously about the hut groping before and about them to find something to show that warrenton had fulfilled his mission
      </font>
      </td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/prompt/61-70970-0024.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/vb/61-70970-0024.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/sf/61-70970-0024.wav" type="audio/wav"></audio></td>
    </tr>
    <tr>
      <td style="text-align: left;vertical-align:middle; ">
      <font size="2">
        And lay me down in thy cold bed and leave my shining lot
      </font>
      </td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/prompt/908-157963-0027.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/vb/908-157963-0027.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/sf/908-157963-0027.wav" type="audio/wav"></audio></td>
    </tr>
    <tr>
      <td style="text-align: left;vertical-align:middle; ">
      <font size="2">
        And the whole night the tree stood still and in deep thought
      </font>
      </td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/prompt/672-122797-0040.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/vb/672-122797-0040.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/sf/672-122797-0040.wav" type="audio/wav"></audio></td>
    </tr>
    <tr>
      <td style="text-align: left;vertical-align:middle; ">
      <font size="2">
        Instead of shoes the old man wore boots with turnover tops and his blue coat had wide cuffs of gold braid
      </font>
      </td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/prompt/1284-1180-0002.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/vb/1284-1180-0002.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/sf/1284-1180-0002.wav" type="audio/wav"></audio></td>
    </tr>
    <tr>
      <td style="text-align: left;vertical-align:middle; ">
      <font size="2">
        The army found the people in poverty and left them in comparative wealth
      </font>
      </td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/prompt/4077-13754-0000.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/vb/4077-13754-0000.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/sf/4077-13754-0000.wav" type="audio/wav"></audio></td>
    </tr>
    <tr>
      <td style="text-align: left;vertical-align:middle; ">
      <font size="2">
        Yea his honourable worship is within but he hath a godly minister or two with him and likewise a leech
      </font>
      </td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/prompt/1221-135767-0014.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/vb/1221-135767-0014.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/sf/1221-135767-0014.wav" type="audio/wav"></audio></td>
    </tr>
    <tr>
      <td style="text-align: left;vertical-align:middle; ">
      <font size="2">
        He was in deep converse with the clerk and entered the hall holding him by the arm
      </font>
      </td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/prompt/61-70970-0007.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/vb/61-70970-0007.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/sf/61-70970-0007.wav" type="audio/wav"></audio></td>
    </tr>
    <tr>
      <td style="text-align: left;vertical-align:middle; ">
      <font size="2">
        Number ten fresh nelly is waiting on you good night husband
      </font>
      </td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/prompt/1089-134686-0004.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/vb/1089-134686-0004.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:200px;"><source src="wavs/zstts/sf/1089-134686-0004.wav" type="audio/wav"></audio></td>
    </tr>
  </table>
</br>
</br>

<h2><a id="main_eval">Additional Discussion on Mel-to-waveform</a></h2>

<p class="max-length">To showcase how neural vocoders are not ideal choices for some common metrics of generative tasks, here is a side-by-side comparison of HiFi-GAN and the default signal processing method (pseudo-inversed Mel-to-linear transform + phase information from noisy speech + iSTFT) on speech enhancement. For both sampled data and real data, we can hear neural vocoder delivered better speech quality but all three metrics considered are significantly worse.
</p>
<table >
  <thead >
    <tr >
      <th style="width:200px" rowspan="1">  </th>
      <th style="width:200px" rowspan="1">PESQ / ESTOI / COVL</th>
      <th style="width:150px" rowspan="1">443c020x</th>
      <th style="width:150px" colspan="1">440c0203</th>
      <th style="width:150px" colspan="1">443c020t</th>
      <th style="width:150px" colspan="1">443c020m</th>
      <th style="width:150px" colspan="1">444o0308</th>
      <th style="width:150px" colspan="1">441c0208</th>
    </tr>

    </tr>
  </thead>
    <tr>
      <td colspan="8" style="text-align: left; padding-left: 10px; font-size: 16px;"> Sampled data</td>
    </tr>
    <tr>
      <td> Mel Spectrogram </br> (invMel+noisy phase+iSTFT)</td>
      <td>2.70 / 0.90 / 3.36 </td>
      <td><audio controls style="width:150px;"><source src="wavs/se/sf/443c020x.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/sf/440c0203.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/sf/443c020t.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/sf/443c020m.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/sf/444o0308.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/sf/441c0208.wav" type="audio/wav"></audio></td>
    </tr>
    <tr>
      <td > Mel Spectrogram </br> (HiFi-GAN)</td>
      <td>2.29 / 0.81 / 2.96 </td>
      <td><audio controls style="width:150px;"><source src="wavs/se/sfhf/443c020x.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/sfhf/440c0203.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/sfhf/443c020t.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/sfhf/443c020m.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/sfhf/444o0308.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/sfhf/441c0208.wav" type="audio/wav"></audio></td>
    </tr>
    <tr>
      <td colspan="8" style="text-align: left; padding-left: 10px; font-size: 16px;">Real data </td>
    </tr>
    <tr>
      <td > Mel Spectrogram </br> (invMel+noisy phase+iSTFT)</td>
      <td> 3.68 / 0.96 / 4.46 </td>
      <td><audio controls style="width:150px;"><source src="wavs/se/istft/443c020x.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/istft/440c0203.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/istft/443c020t.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/istft/443c020m.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/istft/444o0308.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/istft/441c0208.wav" type="audio/wav"></audio></td>
    </tr>
    <tr>
      <td > Mel Spectrogram </br> (HiFi-GAN)</td>
      <td> 2.80 / 0.73 / 3.69 </td>
      <td><audio controls style="width:150px;"><source src="wavs/se/hifi/443c020x.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/hifi/440c0203.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/hifi/443c020t.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/hifi/443c020m.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/hifi/444o0308.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/hifi/441c0208.wav" type="audio/wav"></audio></td>
    </tr>
    <tr>
      <td > Waveform </td>
      <td> 4.5 / 1.00 / 5.00 </td>
      <td><audio controls style="width:150px;"><source src="wavs/se/gt/443c020x.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/gt/440c0203.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/gt/443c020t.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/gt/443c020m.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/gt/444o0308.wav" type="audio/wav"></audio></td>
      <td><audio controls style="width:150px;"><source src="wavs/se/gt/441c0208.wav" type="audio/wav"></audio></td>
    </tr>

  </tbody>
</table>
</br>
</br>

  </body>
</html>
