We used talking portrait videos from [AD-NeRF](https://github.com/YudongGuo/AD-NeRF), [GeneFace](https://github.com/yerfor/GeneFace) and [HDTF dataset](https://github.com/MRzzm/HDTF). 
These are static videos whose average length are about 3~5 minutes.

You can see an example video with the below line:

```
wget https://github.com/yerfor/GeneFace/releases/download/v1.1.0/May.zip
```

We also used [SynObama](https://grail.cs.washington.edu/projects/AudioToObama/) for cross-driven setting inference.