A Multi-stage and Multi-target Knowledge Distillation Framework for Multimodal Conversational Emotion Recognition
Abstract: In Emotion Recognition in Conversations (ERC), one-hot labels are typically used as ground truth, but they may not fully capture all emotions conveyed in an utterance. Recent work in textual ERC has investigated self-distillation techniques for generating soft labels via single-instance generation, aiming to improve emotional understanding. However, these approaches still struggle to fully capture complex emotional expressions. In multimodal ERC (MERC), generating soft labels is even more challenging due to integrating multiple modalities, which may express distinct emotions. Based on this, we propose a Multi-stage Multi-target Knowledge Distillation Framework, consisting of two components: Multi-stage Distillation (MSD) and Multi-target Distillation (MTD). MSD focuses on multi-stage self-distillation of soft labels and utterance representations, encouraging the MERC model to refine its label predictions across stages. Building on MSD, MTD further distills soft labels and features from a feature extractor used to extract modality-common and modality-specific features, deepening the model’s understanding of emotions in multimodal scenarios. Experimental results on two datasets show that our framework significantly improves the performance of various MERC models, surpassing state-of-the-art methods.
External IDs:dblp:conf/icmcs/NiuTWQX25
Loading