Keywords: Multimodal Sentiment Analysis, Prompt Learning, Vision-Language Models, Contrastive Learning
Abstract: Multimodal sentiment analysis aims to infer affective states from image–text pairs in social media. Most existing approaches rely on single-step fusion or static representations, treating affective cues as fixed and non-progressive representations. Meanwhile, prompt-based methods typically initialize prompts with sentiment-irrelevant text or random vectors, or inject auxiliary semantics in a one-step manner, failing to explicitly guide semantic evolution. To address these limitations, we propose a semantic-guided progressive framework with stage-wise prompt interaction (SPRO), which organizes multimodal supervision along a cognitively inspired trajectory from Tone to Emotion. Specifically, emotion understanding is decomposed into three successive stages—Tone, Content, and Emotion—corresponding to perceptual appraisal, semantic grounding, and affective reasoning. At each stage, LLM-generated structured captions provide explicit semantic guidance, while learnable multimodal prompts serve as a shared affective interface to progressively align visual and textual representations within a unified semantic space. Furthermore, a dual-path contrastive alignment strategy jointly optimizes image–category and text–category consistency, reinforcing cross-modal semantic consistency. Experiments demonstrate that SPR achieves superior accuracy and interpretability over state-of-the-art methods. The source code is publicly available.
Paper Type: Long
Research Area: Sentiment Analysis, Stylistic Analysis, and Argument Mining
Research Area Keywords: Sentiment Analysis, multimodality, image text matching
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 3228
Loading