Abstract: Understanding social context and affect plays a key role in our day-to-day interaction and enables us to better understand the situation, our own self, and others. Classifying a social event and associated perceived affect in a visual scene involving humans is a step towards providing better scene understanding capability to socially interactive artificial intelligence agents. In this work, we perform Multi-Task Learning (MTL) to jointly learn and predict both social event and affect in group videos and explore whether perceived group affect has an influence on social event classification. Transformer based models in computer vision have gained a lot of attention in video classification tasks. We propose spatio-temporal transformer based models for our tasks and evaluate their performance using Video Group AFfect (VGAF) dataset, which consists of unconstrained YouTube video clips of social events and incorporates perceived affect information. Our contributions are as follows. We introduce ten social event category labels for the clips in the VGAF dataset from existing keywords. We propose a spatio-temporal transformer model and its variations for joint learning and prediction, and show improvements to social event prediction by utilizing affect information. Our proposed models improve the state-of-art for group affect prediction on VGAF dataset. Code is available at https://github.com/aarti9/VideoSocialContext
Loading