Abstract: Multimodal sarcasm is often used to express strong emotions online through the discrepancy of the literal-figurative scene across multi-modalities. Current researches retrofit transform-based pretrained language models to integrate text and image to detect sarcasm. However, these methods struggle to distinguish subtle semantic and emotional differences between image and text within the same instance. To address this issue, this paper proposes a new context-aware dual attention network that collaboratively performs textual and visual attentions using a shared memory module. This approach enables us to reason about the interconnected portions involving sarcasm in both text and image. Additionally, we use implicit context derived from multimodal commonsense graph to establish a holistic perspective that encompasses semantics and emotions across modalities. Finally, multi-view cross-modal matching technique is employed to effectively identify contradictions. We evaluate our method on the widely used HFM dataset and achieve 1.01% improvements on the F1-score. Extensive experiments demonstrate the effectiveness of the proposed method.
Loading