Multimodal Fusion via Hypergraph Autoencoder and Contrastive Learning for Emotion Recognition in Conversation
Abstract: Multimodal emotion recognition in conversation (MERC) seeks to identify the speakers' emotions expressed in each utterance, offering significant potential across diverse fields. The challenge of MERC lies in balancing speaker modeling and context modeling, encompassing both long-distance and short-distance contexts, as well as addressing the complexity of multimodal information fusion. Recent research adopts graph-based methods to model intricate conversational relationships effectively. Nevertheless, the majority of these methods utilize a fixed fully connected structure to link all utterances, relying on convolution to interpret complex context. This approach can inherently heighten the redundancy in contextual messages and excessive graph network smoothing, particularly in the context of long-distance conversations. To address this issue, we propose a framework that dynamically adjusts hypergraph connections by variational hypergraph autoencoder (VHGAE), and employs contrastive learning to mitigate uncertainty factors during the reconstruction process. Experimental results demonstrate the effectiveness of our proposal against the state-of-the-art methods on IEMOCAP and MELD datasets. We release the code to support the reproducibility of this work (currently it is uploaded as the "complementary material" within the review system and will be made public following the completion of the review process).
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Engagement] Emotional and Social Signals
Relevance To Conference: Our work contributes to multimedia/multimodal processing by introducing a novel framework that leverages hypergraph learning for multimodal emotion recognition in conversation (MERC). Specifically, this framework utilizes multimodal data sources including text, audio, and visual cues to capture the complex interactions between different modalities in understanding emotions expressed by the speakers during conversations. The proposed framework achieves a joint optimization of hypergraph reconstruction, contrastive learning, and emotion recognition. Moreover, contrastive learning is introduced to mitigate the impact of uncertainty in the sampling process and the Gumbel-softmax learning process of variational hypergraph autoencoder, enhancing the robustness and stability of the model. The extensive experiment validates the effectiveness of our proposal. This research is expected to pave the way for more accurate emotion recognition systems in diverse applications such as human-computer interaction and sentiment analysis.
Supplementary Material:  zip
Submission Number: 5177
Loading