Disagreement-Aware Robust Training for Multi-modal Stance Detection under Model-internal Decision Inconsistency

Disagreement-Aware Robust Training for Multi-modal Stance Detection under Model-internal Decision Inconsistency

ACL ARR 2026 January Submission5216 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multimodal stance detection, stance detection, multimodal fusion, robust training, data augmentation

Abstract: Stance detection is usually studied in a single modality, focusing on semantic understanding. On social media, however, stance is expressed in more diverse ways. Multi-modal stance detection (MSD) leverages paired text and images to enrich stance expressions but also introduces a challenge in fusing different modalities. Interestingly, our study uncovers an instructive model-internal state: the text and image encoders can yield inconsistent stance decisions, even when the input pair conveys a unified stance. We term this measurable state as \emph{Modal Decision Disagreement} (MDD). Under this, standard training only supervises the final fused output, it does not constrain how the model should handle these conflicting internal signals. Thus, simple averaging or alignment-oriented fusion often turns into a wrong or compromise prediction. To address this, we propose \textbf{DART}, a disagreement-aware robust training framework. Specifically, we utilize a decision-level auxiliary head to regularize the fused predictor against branch disagreement. Moreover, to further improve robustness to such inconsistencies, we apply a text stance-flip perturbation that creates deliberately conflicting training instances. Together, they make fusion more stable under branch-level disagreement. Across all five MSD benchmarks, we improve both in-target and zero-shot performance, with the largest gains when the model exhibits MDD.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: multimodality, cross-modal pretraining, image text matching, cross-modal application

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 5216

Loading