Abstract: Emotion recognition in conversations (ERC) has garnered significant attention for its critical role in human-computer interaction systems. ERC benefits from multimodal data, which offers diverse perspectives on emotional states, and commonsense knowledge (CSK), which enriches the context by incorporating real-world understanding of human behavior. However, existing ERC studies have not fully exploited the potential of multimodal-CSK interactions for complementary information learning from these sources. To address this, we innovatively propose a Knowledge-Aware Multimodal Interaction Network (KA-MIN). KA-MIN is designed to capture complementary emotional information from CSK-multimodal interactions, thereby facilitating the ERC task. To achieve this, KA-MIN begins by combining six relation types of CSK, leveraging their differences between multimodal emotional information. The fused CSK features are then refined to incorporate context and emotional information using multimodal contextual guidance. Subsequently, we construct a novel knowledge-aware multimodal graph structure that allows the CSK information to interact with multimodal information, leading to more comprehensive multimodal and context modeling. During the graph learning process, the CSK-multimodal interactions capture the complementary emotional information between CSK and multimodal features. Finally, we dynamically fuse the multimodal emotional information with the informative CSK and textual guidance to obtain the final utterance representations, which encompass effective emotional information from both multimodal and CSK features. Extensive experiments on two popular multimodal ERC datasets demonstrate the superiority and effectiveness of the proposed KA-MIN framework.
External IDs:dblp:journals/tcsv/RenHLGSL25
Loading