Sec2Sec Co-Attention Transformer for Video-Based Apparent Affective Prediction

Mingwei Sun, Kunpeng Zhang

Published: 01 Jan 2024, Last Modified: 05 Feb 2025ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Video-based apparent affect detection plays a crucial role in video understanding, as it encompasses various elements such as vision, audio, audio-visual interactions, and spatiotemporal information, which are essential for accurate video predictions. However, existing approaches often focus on extracting only a subset of these elements, resulting in the limited predictive capacity of their models. To address this limitation, we propose a novel LSTM-based network augmented with a Transformer co-attention mechanism for predicting apparent affect in videos. We demonstrate that our proposed Sec2Sec Co-attention Transformer surpasses multiple state-of-the-art methods in predicting apparent affect on two widely used datasets: LIRIS-ACCEDE and First Impressions. Notably, our model offers interpretability, allowing us to examine the contributions of different time points to the overall prediction. The implementation is available at: https://github.com/nestorsun/sec2sec.