We attach examples for further evaluation. The index.html file can be opened by Googol Chrome and render a static website.

All the files in assets/TTA are labeled from 1 to 17 with the same order. The captions for TTA can be found in assets/TTA/caption.csv. 

We additional provide long audio examples with assets/TTA/caption_20s.csv

All the files in assets/VTA are labeled from 1 to 6 with the same order. The inference audio result has been combined with the video input(muted) for clearer evaluation