Keywords: Multimodal Learning; Causal Inference; Adversarial Learning; Representation Learning
Abstract: Multimodal Intent Recognition (MIR) plays a key role in advancing human-computer interaction, yet its reliability is often challenged by spurious correlations and missing modalities in real-world data. Existing approaches, which mainly rely on complex fusion architectures or contrastive alignment, generally do not account for the underlying causal structures of multimodal signals, resulting in limited generalization and robustness. They typically treat missing modalities as a data issue addressed by passive imputation rather than an opportunity to learn deeper, causally-informed representations. To address these limitations, we propose the Counterfactual Adversarial Representation Enhancement (CARE) framework, which reframes MIR as a causal learning problem. CARE implements causal principles through two complementary modules: a counterfactual generation module that interprets modality completion as a causal intervention to capture shared, abstract concepts across modalities, and an adversarial de-confounding mechanism. The latter employs a Gradient Reversal Layer and a modality discriminator to remove the confounding effects of the modality combination, enforcing the learning of intervention-invariant representations. This dual approach ensures that the learned intent features are both robust to missing data and causally consistent. We evaluate CARE extensively on the MIntRec and the more challenging MIntRec2.0 datasets. Results show that CARE achieves state-of-the-art performance, surpassing the strongest baseline by up to 4.41% in WF1 and 12.03% in recall, while maintaining high robustness under various missing-modality scenarios. This work introduces a principled paradigm for building causally robust multimodal systems, providing a systematic way to mitigate confounding bias and improve generalization in complex, real-world interactive environments.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 5908
Loading