Abstract: With the rapid increase of multimodal comments on social media, multimodal sentiment detection has become increasingly important. However, most existing methods overlook the difference in information density between text and images, and fall short in fully utilizing multiscale information in images. To address this issue, we propose a multiscale adaptive fusion model termed MSAF for multimodal sentiment detection. MSAF first extracts fine- and coarse-scale features of images through a multiscale visual encoder and uses a multiscale adaptive pooling module to adaptively adjust the weights of different regional features. Then, MSAF incorporates multiscale contrastive learning and multiscale rivalry tasks to ensure that the model retains associations between features at different scales while maintaining their diversity. These features are sequentially fused with text through a hierarchical fusion encoder guided by textual information, enabling MSAF to focus on sentiment-salient regions in the image. Finally, the multimodal fusion embeddings are fed into a classifier to predict the sentiment. Extensive experiments on multiple public datasets demonstrate the effectiveness and superiority of MSAF.
Loading