Abstract: Multimodal sentiment analysis (MSA) is a challenging task that aims to understand human emotions from text, visual, and audio modalities. Existing studies struggle to capture the monotonic relationship between emotional expressions. This monotonic relationship means that the emotional intensity changes consistently with the expression amplitude when considering emotional polarities, which is a crucial aspect of MSA tasks. To tackle this, we propose a polarity-aware mixture of experts network (PAMoE-MSA). PAMoE-MSA is capable of learning polarity-specific and polarity-common features to capture the monotonic relationship of emotional expressions from multimodal sentiment data. Our model consists of three experts: a positive expert, a negative expert, and a general expert. They are trained through a unique Guide Task, where the positive and negative experts are trained by non-neutral samples, while the general expert is trained by all samples. A gating mechanism is utilized to adaptively perceive the monotonic relationship within emotional expressions. Moreover, the self-supervised labels are introduced to preserve modality-specific information. The experts module is fed with the fusion features, which contain richer emotional information. To enhance model stability during the training phase, we employ multi-side contrastive learning before making predictions. Our evaluation of PAMoE-MSA on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets shows notable improvements over state-of-the-art methods, with increases of approximately 1.3% in Acc-7 for CMU-MOSI, 1.2% in Acc-2 for CMU-MOSEI, and 0.8% in F1-score for CH-SIMS.
Loading