Abstract: Multimodal sarcasm detection (MSD) aims to identify sarcastic expressions by integrating multimodal and contextual cues to capture cross-modal semantic inconsistencies. However, existing studies face challenges: fine-grained sarcastic cues are implicit and dispersed, large generative models suffer from gradient vanishing in classification tasks, and cross-domain generalization remains limited. To address these limitations, we propose InterARM, an interpretable affective reasoning model that introduces a structured sarcasm reasoning paradigm. Specifically, we introduce a three-stage training strategy based on curriculum learning, consisting of 1) sarcasm classification learning, 2) structured sarcasm reasoning learning, and 3) step-selective hierarchical reward reinforcement learning. The model generates reasoning chains across modalities to analyze sentiment polarity and semantic conflicts. We also construct the MSD-CoT dataset with 3,200 image-text pairs with 19,200 human-annotated reasoning steps. Experiments show that InterARM achieves superior performance on the in-domain and out-of-domain datasets, outperforming larger MLLMs such as Qwen2.5-VL-7B and GPT-5-mini while maintaining high interpretability and strong cross-domain generalization.
External IDs:doi:10.1109/taffc.2026.3653505
Loading