Disagreement-Aware Robust Training for Multi-modal Stance Detection under Model-internal Decision Inconsistency
Keywords: multimodal stance detection, stance detection, multimodal fusion, robust training, data augmentation
Abstract: Stance detection is usually studied in a single modality, focusing on semantic understanding.
On social media, however, stance is expressed in more diverse ways. Multi-modal stance detection (MSD) leverages paired text and images to enrich stance expressions but also introduces a challenge in fusing different modalities.
Interestingly, our study uncovers an instructive model-internal state: the text and image encoders can yield inconsistent stance decisions, even when the input pair conveys a unified stance.
We term this measurable state as \emph{Modal Decision Disagreement} (MDD).
Under this, standard training only supervises the final fused output, it does not constrain how the model should handle these conflicting internal signals. Thus, simple averaging or alignment-oriented fusion often turns into a wrong or compromise prediction.
To address this, we propose \textbf{DART}, a disagreement-aware robust training framework.
Specifically, we utilize a decision-level auxiliary head to regularize the fused predictor against branch disagreement.
Moreover, to further improve robustness to such inconsistencies, we apply a text stance-flip perturbation that creates deliberately conflicting training instances. Together, they make fusion more stable under branch-level disagreement.
Across all five MSD benchmarks, we improve both in-target and zero-shot performance, with the largest gains when the model exhibits MDD.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, cross-modal pretraining, image text matching, cross-modal application
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 5216
Loading