Abstract: Multimodal Sentiment Analysis (MSA) focuses on leveraging multimodal signals for understanding human sentiment. Most of the existing works rely on superficial information, neglecting the incorporation of contextual world knowledge (e.g., background information derived from but beyond the given image and text pairs), thereby restricting their ability to achieve better multimodal sentiment analysis (MSA). In this paper, we propose a plug-in framework named WisdoM, to leverage the contextual world knowledge induced from the large vision-language models (LVLMs) for enhanced MSA. WisdoM utilizes LVLMs to comprehensively analyze both images and corresponding texts, simultaneously generating pertinent context. Besides, to reduce the noise in the context, we design a training-free contextual fusion mechanism. We evaluate our WisdoM in both the aspect-level and sentence-level MSA tasks on the Twitter2015, Twitter2017, and MSED datasets. Experiments on three MSA benchmarks upon several advanced LVLMs, show that our approach brings consistent and significant improvements (up to +6.3% F1 score).
Primary Subject Area: [Engagement] Emotional and Social Signals
Secondary Subject Area: [Engagement] Emotional and Social Signals, [Content] Multimodal Fusion
Relevance To Conference: The work on WisdoM contributes to multimedia/multimodal processing by enhancing multimodal sentiment analysis (MSA) through the fusion of contextual world knowledge. Traditional MSA models have been limited by their reliance on superficial data from text and image modalities, lacking deeper contextual understanding. WisdoM addresses this limitation by employing large vision-language models (LVLMs) to generate context-rich world knowledge from the combined analysis of images and texts. This approach allows for a more nuanced understanding of sentiment, capturing underlying themes and nuances beyond the surface level of the data.
Supplementary Material: zip
Submission Number: 3780
Loading