Enhancing Semantic Awareness by Sentimental Constraint With Automatic Outlier Masking for Multimodal Sarcasm Detection

Published: 2025, Last Modified: 05 Jan 2026IEEE Trans. Multim. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Multimodal sarcasm detection, aiming to uncover sarcastic sentiment behind multimodal data, has gained substantial attention in multimodal communities. Recent advancements in multimodal sarcasm detection (MSD) methods have primarily focused on modality alignment with pre-trained vision-language (V-L) model. However, text-image pairs often exhibit weak or even opposite semantic correlations in MSD tasks. Consequently, directly aligning these modalities can potentially result in feature shift and inter-class confusion, ultimately hindering the model's ability. To alleviate this issue, we propose the Enhancing Semantic Awareness Model (ESAM) for multimodal sarcasm detection. Specifically, we first devise a Modality-decoupled Framework (MDF) to separate the textual and visual features from the fused multimodal representation. This decoupling enables the parallel integration of the Sentimental Congruity Constraint (SCC) within both visual and textual latent spaces, thereby enhancing the semantic awareness of different modalities. Furthermore, given that certain outlier samples with ambiguous sentiments can mislead the training and weaken the performance of SCC, we further incorporate Automatic Outlier Masking. This mechanism automatically detects and masks the outliers, guiding the model to focus on more informative samples during training. Experimental results on two public MSD datasets validate the robustness and superiority of our proposed ESAM model.
Loading