Emotion-JEPA: Predictive Visual Adaptation and Audio-Modulated Fusion for Multimodal Emotion Recognition
Abstract: Multimodal emotion recognition (MER) requires combining visual, acoustic, and textual cues from short, noisy, and often ambiguous emotional expressions. While large pretrained multimodal models provide strong general-purpose representations, their direct use for MER can be limited by a mismatch between generic pretraining data and fine-grained affective behavior, as well as by fusion mechanisms that do not explicitly account for modality reliability. We study a two-stage framework for MER that isolates two factors: affective visual representation adaptation and reliability-aware multimodal fusion. In the first stage, we adapt a visual encoder to the emotion domain using predictive self-supervised learning on unlabeled emotion videos, without using pseudo-labels or additional manual annotations. In the second stage, we train a supervised multimodal classifier with Audio-Modulated Hybrid Fusion (AMHF), where audio cues guide cross-modal interaction through audio spectral gating, adaptive cross-modal routing, temporal memory, uncertainty estimation, and progressive fusion. On the MER2024-SEMI benchmark, visual predictive adaptation improves performance by $+7.92$ weighted-average F1 (WAF) over the same model without domain adaptation. Under matched encoders, parameter budgets, and training protocols, AMHF improves performance by $+7.25$ WAF over a capacity-matched cross-attention fusion baseline. Component-level ablations further show that each AMHF stage contributes to the final performance. These results suggest that, for MER under limited supervision, domain-aligned representation learning and reliability-aware fusion can be as important as increasing model scale.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Yuhang_Zang1
Submission Number: 8991
Loading