Abstract: In this paper, we present VATE, the Video-Audio-Text for affective Evaluation dataset. VATE collects a wide variety of multimodal data exhibiting a multitude of spontaneous human affective states. It contains 21,871 raw videos together with voice recordings and text transcriptions from numerous emotion-evoking interviews. VATE is specifically designed for contrastive self-supervised representation learning of human affective states; it prioritises quantity and quality of data over human labelling of emotions, which constitutes a highly subjective, often inconsistent and controversial aspect of modern affective computing. To highlight the usefulness of our proposal, we release a multimodal encoder employing a contrastive video-language-audio pre-training procedure carried out on the VATE dataset. Experimental results show that such model exhibits sensibly better few-shot generalization abilities when compared to fully supervised baselines on different downstream tasks. Data and Code available at: https://github.com/FrancescoAgnelli3/VATE.
External IDs:doi:10.1007/978-3-031-91575-8_13
Loading