To reviewers and AC: The media files are too huge to be put here. We will find a host to publicalize our data.
 
In the appendix of our paper, we gave a few examples from our raw data (with links to the external multi-modal input). 