Abstract: Target-oriented multimodal sentiment classification (TMSC) aims to determine the sentiment polarity associated with each target within a sentence-image pair. Previous research has not distinguished between the status of textual and visual modalities or has subjectively given primary status to the textual modality. However, in diverse contexts, the impact of each modality on predicting the sentiment polarity of the target word varies. Given the pivotal role of the target word in TMSC, we introduce a framework with adaptive modality weighting to detect target-related information. Specifically, this framework adaptively determines the importance of each modality based on the generated contribution weights for sentiment prediction towards the target word. The modality with relatively larger weights is considered the primary modality and leveraged to enhance the multimodal representation during the fusion stage. To further acquire information related to the target word, the large vision-language model is used to generate external target-specific knowledge descriptions as a supplementary textual modality, helping to identify the sentiment of each target term accurately. Experimental results on the multimodal Twitter-2015 and Twitter-2017 datasets show that our proposed method outperforms other competitive baselines.
External IDs:dblp:conf/nlpcc/LiuM24
Loading