CIB-MoE: Counterfactual Inconsistency-Bottleneck Mixture-of-Experts for Robust Multimodal Aspect-based Sentiment Analysis and Sarcasm Detection

ACL ARR 2026 January Submission10230 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal learning, Multimodal aspect-based sentiment analysis, Multimodal sarcasm detection, Cross-modal incongruity, Mixture-of-experts, Information bottleneck, Counterfactual calibration
Abstract: Multimodal posts on social media pose a fine-grained affective understanding challenge: the decisive signal often lies in instance-specific discrepancies between text and image, yet models are easily misled by weak cross-modal relevance, heterogeneous mismatch types (semantic, entity-level, and affective), and spurious lexical shortcuts. These issues are central to multimodal aspect-based sentiment analysis (MABSA), which demands aspect-conditioned predictions under noisy visual context, and multimodal sarcasm detection (MMSD), where sarcasm is frequently expressed through cross-modal incongruity rather than surface polarity. We propose **CIB-MoE** (**C**ounterfactual **I**nconsistency-**B**ottleneck **M**ixture-**o**f-**E**xperts), a unified framework that performs discrepancy-aware conditional computation instead of monolithic fusion. CIB-MoE builds lightweight difference experts that quantify complementary mismatch cues—e.g., CLIP-based semantic inconsistency and entity-level misalignment derived from Top-$N$ predicted object labels—and routes them through a two-level gate with an information-bottleneck regularizer for sparse and stable expert usage. To further suppress shortcut-driven routing, we calibrate the gate with realizable counterfactual interventions by substituting the image with neutral (text-aligned) and random (noise) alternatives and imposing ranking/consistency constraints on routing and predictions. Experiments on Twitter-2015/2017 and MMSD/MMSD2.0 show that CIB-MoE achieves state-of-the-art performance while improving robustness under distribution shift and counterfactual evaluation.
Paper Type: Long
Research Area: Sentiment Analysis, Stylistic Analysis, and Argument Mining
Research Area Keywords: emotion detection and analysis, language resources
Contribution Types: Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 10230
Loading