VATE: A Large Scale Multimodal Spontaneous Dataset for Affective Evaluation

Francesco Agnelli, Giuliano Grossi, Alessandro D’Amelio, Marco De Paoli, Raffaella Lanzarotti

Published: 01 Jan 2025, Last Modified: 09 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: In this paper, we present VATE, the Video-Audio-Text for affective Evaluation dataset. VATE collects a wide variety of multimodal data exhibiting a multitude of spontaneous human affective states. It contains 21,871 raw videos together with voice recordings and text transcriptions from numerous emotion-evoking interviews. VATE is specifically designed for contrastive self-supervised representation learning of human affective states; it prioritises quantity and quality of data over human labelling of emotions, which constitutes a highly subjective, often inconsistent and controversial aspect of modern affective computing. To highlight the usefulness of our proposal, we release a multimodal encoder employing a contrastive video-language-audio pre-training procedure carried out on the VATE dataset. Experimental results show that such model exhibits sensibly better few-shot generalization abilities when compared to fully supervised baselines on different downstream tasks. Data and Code available at: https://github.com/FrancescoAgnelli3/VATE.

External IDs:doi:10.1007/978-3-031-91575-8_13