SEER: Label-Structured Modality Routing for Multimodal Sentiment Analysis and Intent Recognition

20 Apr 2026 (modified: 09 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Multimodal sentiment analysis and intent recognition require models to combine textual, acoustic, and visual evidence whose reliability varies across utterances. Although adaptive fusion can address this variability by assigning sample-specific modality weights, many existing routing mechanisms estimate confidence from raw feature statistics, generic similarity measures, or prototype assignments that are only indirectly related to the downstream label structure. This can make routing sensitive to modality style or feature magnitude rather than to the evidence most relevant for sentiment or intent prediction. To study this issue, we introduce a staged routing framework. First, Emotion-Aware Modality Calibration (EAMC) serves as an encoded-space routing baseline that estimates modality reliability after semantic encoding while keeping the backbone and weighted-sum fusion rule fixed. Building on this baseline, we propose Structured Evidence Estimation and Routing (SEER), which incorporates label structure into the representation space used for confidence estimation. SEER-L0 adds label-aware contrastive supervision to organize modality representations according to task labels, while SEER-L1 estimates modality confidence by matching modality-adapted representations to shared label-structured anchors. We also evaluate SEER-L2, a prototype-guided temporal evidence extraction extension. Experiments on aligned CMU-MOSI, aligned CMU-MOSEI, and MIntRec under a multi-run evaluation protocol show that SEER-L1 provides the most consistent improvement over EAMC on the primary metrics, namely binary F1 for sentiment analysis and Weighted F1 for intent recognition. In contrast, SEER-L2 does not improve performance in the current setting. These results suggest that, for the evaluated benchmarks, adaptive multimodal routing benefits more from label-structured confidence estimation than from adding temporal pooling complexity.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Han-Jia_Ye1
Submission Number: 8532
Loading