A Sentimental Prompt Framework with Visual Text Encoder for Multimodal Sentiment Analysis

Shizhou Huang, Bo Xu, Changqun Li, Jiabo Ye, Xin Lin

Published: 01 Jan 2024, Last Modified: 18 May 2025ICMR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recently, multimodal sentiment analysis from social media posts has received increasing attention, as it can effectively improve single-modality-based sentiment analysis by leveraging the complementary information between text and images. Despite their success, current methods still suffer from two weaknesses: (1) the current methods for obtaining image representations do not obtain sentiment information, which leads to a significant gap between image representations and results; (2) the current methods ignore the sentiments expressed by the symbols (emoticons, emojis) in the text, but these symbols can effectively reflect the user's sentiments. To address these issues, we propose a sentimental prompt framework with visual text encoder (SPFVTE). Specifically, for the first problem, instead of using the image representation directly, we project the image representation as a prompt and utilize the prompt learning to capture sentimental information in images by learning a sentiment-specific prompt. For the second problem, considering that people get the meanings of emojis and emoticons from their graphics, we propose to render the text as an image and use a visual text encoder to capture the sentiments contained in emojis and emoticons. We have conducted experiments on three public multimodal sentiment datasets, and the experimental results show that our method can significantly and consistently outperform the state-of-the-art methods. The datasets and source code can be found at https://github.com/JinFish/SPFVTE.