Unimodal-driven Distillation in Multimodal Emotion Recognition with Dynamic Fusion

27 Sept 2024 (modified: 20 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Emotion Recognition in Conversations, Multimodal Representation, Mixture of Experts, Knowledge Distillation
Abstract: Multimodal Emotion Recognition in Conversations (MERC) seeks to identify emotional states across multiple modalities, including text, audio, and video. This field of study is pivotal for advancing machine intelligence, with significant implications for applications such as intelligent dialogue systems and public opinion analysis. Most existing approaches primarily employ full-sequence interaction and distillation techniques, aiming to construct a comprehensive global contextual understanding while simultaneously enhancing the interaction among heterogeneous modalities. However, the presence of repetitive and redundant information, coupled with gradient conflicts arising from modal heterogeneity, can significantly impede the effectiveness of multimodal learning and long-range relationship modeling. In this work, we propose an innovative heterogeneous multimodal integration method called SUMMER, grounded in attention mechanism and knowledge distillation techniques, which facilitates dynamic interactive fusion of multimodal representations. Specifically, the Sparse Dynamic Mixture of Experts strategy is proposed to dynamically adjust the relevance of the temporal information to construct local to global token-wise interactions. Then a Global Mixture of Experts is employed to enhance the model's overall contextual understanding across modalities. Notably, we introduce retrograde distillation that utilizes a pre-trained unimodal teacher model to guide the learning of multimodal student model, intervening and supervising multimodal fusion within both the latent and logit spaces. Experiments on the IEMOCAP and MELD datasets demonstrate that our SUMMER framework consistently outperforms existing state-of-the-art methods, with particularly significant improvements in recognizing minority and semantically similar emotions in MERC tasks.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 10501
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview