The supplementary material primarily contains video results.
These include results for fixed head pose compared to baseline methods: comparison without pose.
Results for dynamic head pose compared to baseline methods: comparison with pose.
Results for changing the text caption of the same audio input: comparison with different style.
Some examples from user studies: some userstudy videos.
Implementation code of some key parts: core code.