Uncertain multimodal intention and emotion understanding in the wild
Abstract: Understanding intention and emotion from social media
poses unique challenges due to the inherent uncertainty in
multimodal data, where posts often contain incomplete or
missing modalities. While this uncertainty reflects realworld scenarios, it remains underexplored within the computer vision community, particularly in conjunction with
the intrinsic relationship between emotion and intention.
To address these challenges, we introduce the Multimodal
IntentioN and Emotion Understanding in the Wild (MINE)
dataset, comprising over 20,000 topic-specific social media posts with natural modality variations across text, image, video, and audio. MINE is distinctively constructed
to capture both the uncertain nature of multimodal data
and the implicit correlations between intentions and emotions, providing extensive annotations for both aspects. To
tackle these scenarios, we propose the Bridging EmotionIntention via Implicit Label Reasoning (BEAR) framework.
BEAR consists of two key components: a BEIFormer that
leverages emotion-intention correlations, and a Modality
Asynchronous Prompt that handles modality uncertainty.
Experiments show that BEAR outperforms existing methods in processing uncertain multimodal data while effectively mining emotion-intention relationships for social media content understanding.
Loading