Abstract: Sarcasm in social media, frequently conveyed through the interplay of text and images, presents significant challenges for sentiment analysis and intention mining. Existing multi-modal sarcasm detection approaches have been shown to excessively depend on superficial cues within the textual modality, exhibiting limited capability to accurately discern sarcasm through subtle text-image interactions. To address this limitation, a novel framework, InterCLIP-MEP, is proposed. This framework integrates Interactive CLIP (InterCLIP), which employs an efficient training strategy to derive enriched cross-modal representations by embedding inter-modal information directly into each encoder, while using approximately 20.6\(\times\) fewer trainable parameters compared with existing state-of-the-art (SOTA) methods. Furthermore, a Memory-Enhanced Predictor (MEP) is introduced, featuring a dynamic dual-channel memory mechanism that captures and retains valuable knowledge from test samples during inference, serving as a non-parametric classifier to enhance sarcasm detection robustness. Extensive experiments on MMSD, MMSD2.0, and DocMSU show that InterCLIP-MEP achieves SOTA performance, specifically improving accuracy by 1.08% and F1 score by 1.51% on MMSD2.0. Under distributional shift evaluation, it attains 73.96% accuracy, exceeding its memory-free variant by nearly 10% and the previous SOTA by over 15%, demonstrating superior stability and adaptability. The implementation of InterCLIP-MEP is publicly available at https://github.com/CoderChen01/InterCLIP-MEP.
External IDs:doi:10.1145/3776561
Loading