Muti-Modal Emotion Recognition via Hierarchical Knowledge Distillation

Published: 01 Jan 2024, Last Modified: 11 Apr 2025IEEE Trans. Multim. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Due to its wide applications, multimodal emotion recognition has gained increasing research attention. Although existing methods have achieved compelling success with various multimodal fusion methods, they overlook that the dominated modality (e.g., text) may cause a shortcut and hence negatively affect the representation learning of other modalities (e.g., image and audio). To alleviate such a problem, we resort to the knowledge distillation to narrow the gap between different modalities. In particular, we develop a new hierarchical knowledge distillation model for multi-modal emotion recognition (HKD-MER), consisting of three components, feature extraction, hierarchical knowledge distillation, and attentive multi-modal fusion. As the major contribution in our proposed model, the hierarchical knowledge distillation is designed to transfer the knowledge from the dominant modality to the others at both the feature and label levels. It boosts the performance of non-dominated modalities by modeling the inter-modal relation between different modalities. We have justified the effectiveness of our proposed model over two benchmark datasets.
Loading