Abstract: Multimodal sarcasm detection aims to identify sarcasm from given text-image pairs, where subtle contradictions between modalities are key to identifying irony. This task is essential for understanding nuanced human communications, especially in social media contexts. However, existing methods often overfit superficial textual patterns or fail to adequately model cross-modal incongruities, resulting in suboptimal performance. To address this, we propose the Generative Sarcasm Discrepancy Network (GSDNet), which more effectively exploits cross-modal conflicts. GSDNet features a specialized Generative Discrepancy Representation Module (GDRM), which synthesizes image-aligned text using a large language model and quantifies both semantic and sentiment discrepancies by comparing the generated text with the original input. These discrepancies are then integrated with text and image representations via a gated fusion mechanism, enabling adaptive balancing of modality contributions and mitigating modality dominance and spurious correlations. Extensive experiments on two benchmarks demonstrate that GSDNet outperforms state-of-the-art models, achieving superior accuracy and robustness. These results highlight the effectiveness of discrepancy-based features and gated multimodal fusion in enhancing sarcasm detection.
Paper Type: Long
Research Area: Sentiment Analysis, Stylistic Analysis, and Argument Mining
Research Area Keywords: stance detection, argument mining, style analysis
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 7447
Loading