Stimuvar: Spatiotemporal stimuli-aware video affective reasoning with multimodal large language models
Abstract: Predicting and reasoning how a video would make a human feel is crucial for developing socially intelligent systems. Although Multimodal Large Language Models
(MLLMs) have shown impressive video understanding capabilities, they tend to
focus more on the semantic content of videos, often overlooking emotional stimuli.
Hence, most existing MLLMs fall short in estimating viewers’ emotional reactions
and providing plausible explanations. To address this issue, we propose StimuVAR, a spatiotemporal Stimuli-aware framework for Video Affective Reasoning
(VAR) with MLLMs. StimuVAR incorporates a two-level stimuli-aware mechanism: frame-level awareness and token-level awareness. Frame-level awareness
involves sampling video frames with events that are most likely to evoke viewers’
emotions. Token-level awareness performs tube selection in the token space to
make the MLLM concentrate on emotion-triggered spatiotemporal regions. Furthermore, we create VAR instruction data to perform affective training, steering
MLLMs’ reasoning strengths towards emotional focus and thereby enhancing
their affective reasoning ability. To thoroughly assess the effectiveness of VAR,
we provide a comprehensive evaluation protocol with extensive metrics. StimuVAR is the first MLLM-based method for viewer-centered VAR. Experiments
demonstrate its superiority in understanding viewers’ emotional responses to
videos and providing coherent and insightful explanations. Our code is available
at https://github.com/EthanG97/StimuVAR.
Loading