I$^2$C: Intra- and Inter-modality Consistency Learning for Multimodal Sentiment Analysis

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Learning, Multimodal Sentiment Analysis
Abstract: Multimodal sentiment analysis (MSA) aims to predict human sentiments by integrating signals from different modalities such as text, video, and audio. However, sentiment cues are often semantically inefficient—exhibiting inconsistency within and across modalities—that hinders robust understanding and inflates computation. In this paper, we propose I$^2$C, a framework that explicitly models Intra- and Inter-modality Consistency to guide effective and efficient sentiment prediction. I$^2$C first projects token-level features into a shared sentiment space and computes intra- and inter-modality consistency scores (I$^2$CS). The I$^2$CS serves three functions: (1) as a consistency loss for regularizing training; (2) as token-wise weights for reweighting features; and (3) as a compression signal for eliminating redundant or conflicting tokens. Extensive experiments are conducted on the CMU-MOSI and CMU-MOSEI datasets, and the results show that I$^2$C outperforms previous state-of-the-art models. Despite removing 90\% of tokens, I$^2$C maintains comparable performance, exhibiting remarkable robustness across varying token budgets. All results highlight consistency-aware learning as an effective strategy to improve the accuracy and efficiency of sentiment prediction.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6284
Loading