Abstract: Multimodal Emotion Recognition in Conversations (MERC) aims to identify the emotions conveyed by each utterance in a conversational video. Current efforts focus on modeling speaker-sensitive context dependencies and multimodal fusion. Despite the progress, the reliability of MERC methods remains largely unexplored. Extensive empirical studies reveal that current methods suffer from unreliable predictive confidence. Specifically, in some cases, the confidence estimated by these models increases when a modality or specific contextual cues are corrupted, defining these as uncertain samples. This contradicts the foundational principle in informatics, namely, the elimination of uncertainty. Based on this, we propose a novel calibration framework CMERC to calibrate MERC models without altering the model structure. It integrates curriculum learning to guide the model in progressively learning more uncertain samples; hybrid supervised contrastive learning to refine utterance representations, by pulling uncertain samples and others apart; and confidence constraint to penalize the model on uncertain samples. Experimental results on two datasets show that the CMERC significantly enhances the reliability and generalization capabilities of various MERC models, surpassing the state-of-the-art methods.
Primary Subject Area: [Engagement] Emotional and Social Signals
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: This work significantly advances multimodal processing by addressing a critical challenge in Multimodal Emotion Recognition in Conversations (MERC) – the reliability of predictive confidence. By proposing the Calibration framework for MERC (CMERC), it introduces a novel approach that doesn't require altering the model structure. CMERC integrates curriculum learning to progressively guide the model to learn from uncertain samples, supervised contrastive learning to refine representations, and confidence constraints to penalize uncertainty. This contribution is crucial because it not only improves the generalization of MERC models but also enhances their reliability by ensuring that high-confidence predictions align with true emotions. By addressing the issue of unreliable predictive confidence, CMERC lays a foundation for more trustworthy and robust multimodal emotion recognition systems, thereby advancing the field of multimedia/multimodal processing.
Supplementary Material: zip
Submission Number: 4451
Loading