Abstract: Multimodal Emotion Recognition (MER) may encounter incomplete multimodal scenarios caused by sensor damage or privacy protection in practical applications. Existing incomplete multimodal learning methods focus on learning better joint representations across modalities. However, our investigation shows that they are lacking in learning the unimodal representations which are rather discriminative as well. Instead, we propose a novel framework named Mixture of Modality Knowledge Experts (MoMKE) with two-stage training. In unimodal expert training, each expert learns the unimodal knowledge from the corresponding modality. In experts mixing training, both unimodal and joint representations are learned by leveraging the knowledge of all modality experts. In addition, we design a special Soft Router that can enrich the modality representations by dynamically mixing the unimodal representations and the joint representations. Various incomplete multimodal experiments on three benchmark datasets showcase the robust performance of MoMKE, especially on severely incomplete conditions. Visualization analysis further reveals the considerable value of unimodal and joint representations.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Engagement] Emotional and Social Signals, [Content] Media Interpretation
Relevance To Conference: Multimodal sentiment recognition (MER) has shown significant application value in fields such as human-computer interaction, dialogue systems, and social media analysis. Current MER studies usually utilize multiple modalities, such as text, audio and visual data. However, in practical applications, certain modalities may be unavailable due to sensor damage, speech recognition errors, or privacy protection issues. Thus, it is a crucial issue of model learning in the scenario of incomplete multimodalities. In this work, we propose a novel paradigm named Mixture of Modality Knowledge Experts (MoMKE), which enriches the modality representations by dynamically mixing the unimodal representations of available modality(s) and the joint representations offered by other modality experts. MoMKE goes beyond the limitations of existing works learning insufficient unimodal representation, thus, achieves the state-of-the-art performance on three benchmark datasets, especially showing significant improvements under severely incomplete multimodalities. This study offers a new perspective, and may shed light on enriching the modality representation under incomplete multimodalities.
Submission Number: 5592
Loading