We provide the following audio synthesis samples. We will exhibit synthesis results for V2S as well as drawbacks of exisiting V2S vocoders as discussed in paper.

1. **Reference**
    - Directory: GT
    - Contains: Ground-Truth videos and text transcriptions of original dataset.

1. **V2S synthesis results**
    - Directory: V2S synthesis results
    - Contains: Synthesis results for V2S methods for LRS3-TED and LRS2-BBC as in intelligibility evaluation, respectively.

1. **Zero-shot and Finetuned Vocoder performance on dataset**
    - Directory: Dataset adaptation (LRS2-BBC)
    - Contains: Zero-shot and Finetuned performance for unit-HiFiGAN on LRS2-BBC dataset. Unit-HiFiGAN sounds worse when finetuned on LRS2-BBC.

1. **V2S Finetuning**
    - Directory: V2S Finetuning (HiFiGAN)
    - Contains: V2S finetuning results for ReVISE (Mel) with vocoder finetuned on LRS3-TED. There is a huge gap for HiFiGAN before / after it is finetuned on output of V2S frontend encoder.

