Spatiotemporal Modeling of Bodily Emotional Expressions for Continuous Valence-Arousal-Dominance Prediction in Video

AAAI 2026 Workshop BEEU Submission12 Authors

Published: 18 Nov 2025, Last Modified: 18 Nov 2025BEEU 2026 ConditionalEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Valence-Arousal-Dominance, Bodily-Expressed Emotion, Affective Computing, Video Emotion Recognition, Spatiotemporal Modeling
Abstract: Predicting Valence-Arousal-Dominance (VAD) dimensions from bodily-expressed emotions in videos remains a fundamentally challenging task in affective computing, requiring models that capture subtle spatiotemporal patterns while balancing computational efficiency and interpretability. We present a comprehensive investigation of VAD prediction approaches on the newly introduced Annotated Bodily Expressed Emotion (ABEE) dataset, which contains approximately 3,200 video clips spanning 8 primary emotion categories and 20 subcategories. We explore two complementary methodologies: a feature-based gradient boosting approach using XGBoost with carefully engineered spatiotemporal features and dimensionality reduction, and deep learning architectures capable of learning hierarchical representations directly from raw video data. Our feature-based approach demonstrates exceptional computational efficiency, with sub-second training times and minimal resource requirements, while our deep models reveal the fundamental difficulty of capturing continuous VAD dimensions from bodily expressions. Through systematic evaluation on the ABEE dataset, we establish baseline performance for the VAD prediction task, achieving $R^2$ scores of -0.090, -0.014, and -0.058 for valence, arousal, and dominance, respectively, with our gradient boosting approach. These results highlight the substantial gap between current methodologies and the inherent complexity of bodily emotion signals, providing benchmarks for future research. We further discuss critical insights regarding feature engineering, temporal dynamics, and the intrinsic challenges of continuous emotion prediction from naturalistic video data, emphasizing the need for dedicated spatiotemporal modeling strategies tailored to bodily expressions.
Submission Number: 12
Loading