Abstract: Highlights•Proposing a multimodal fusion network for effective visual–textual sentiment analysis.•The proposed method can eliminate the heterogeneity of visual and textual features.•Attention mechanisms are used to minimize noise interference.•Correlations between local region feature representations are leveraged.•Extensive experiments show a new state-of-the-art performance.
External IDs:dblp:journals/eswa/GanFFZCZ24
Loading