Abstract: Multimodal sentiment analysis (MSA) detects human sentiments by understanding data from multiple modalities, such as text and images. Existing research primarily strives for an effective multimodal fusion framework to derive informative representations. However, these methods neglect the necessity of exploiting external knowledge to aid in analyzing sentiments. As a result, the lack of external commonsense embarrasses these models when the opinion cues come in an implicit and obscure manner. To address the limitation, in this paper, we propose an Auxiliary Rationale Knowledge enhanced framework, namely ARK, which improves MSA models via learning from a multimodal large language model (MLLM). Specifically, based on text-image pairs, we employ Chain-of-Thought prompting to generate image descriptions and rationales from the MLLM as auxiliary knowledge, thus enriching the original samples with commonsense knowledge encoded within the MLLM. By combining the source text with image descriptions, we are able to effectively handle MSA through a Text+Text paradigm. In this paradigm, smaller pre-trained language models (LMs) can be tasked for sentiment classification via prompt-tuning. Besides, rationales are leveraged as additional supervision to facilitate the learning of reasoning abilities by LMs. Experimental results demonstrate that our proposed method outperforms current state-of-the-art approaches across four datasets. Our data and code are available at https://github.com/ningpang/ArkMSA.
Loading