Abstract: Multi-modal Sarcasm Detection (MSD) combines multiple modalities to identify implicit sarcastic sentiment, but commonsense knowledge's role in emotion recognition is often overlooked. Visual emotions tied to sarcastic cues in text are usually dispersed across the image, complicating detection. we propose a Dual Synergetic Perception Graph Convolutional Networks (DSP-GCN) to address these issues. First, we create a cross-modal knowledge incongruity graph linking key visual sentiments and relevant text tokens. Then, we enhance feature focus using a transformer encoder with the Convolutional Block Attention Module. Finally, the Global Modality Synergistic Fusion (GMSF) block models global relationships in each modality for improved sarcasm detection. Note that, our framework outperforms state-of-the-art methods on benchmark datasets.
Loading