DQ-Former: Querying Transformer with Dynamic Modality Priority for Cognitive-aligned Multimodal Emotion Recognition in Conversation
Abstract: Multimodal Emotion Recognition in Conversations aims to understand the human emotion of each utterance in a conversation from different types of data, such as speech and text.
Previous works mainly focus on either complex unimodal feature extraction or sophisticated fusion techniques as general multimodal classification tasks do.
However, they ignore the process of human perception, neglecting various levels of emotional features within each modality and disregarding the unique contributions of different modalities for emotion recognition.
To address these issues, we propose a more cognitive-aligned multimodal fusion framework, namely DQ-Former.
Specifically, DQ-Former utilizes a small set of learnable query tokens to collate and condense various granularities of emotion cues embedded at different layers of pre-trained unimodal models. Subsequently, it integrates these emotional features from different modalities with dynamic modality priorities at each intermediate fusion layer. This process enables explicit and effective fusion of different levels of information from diverse modalities.
Extensive experiments on MELD and IEMOCAP datasets validate the effectiveness of DQ-Former. Our results show that the proposed method achieves a robust and interpretable multimodal representation for emotion recognition.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Engagement] Emotional and Social Signals
Relevance To Conference: This work advances the field of multimodal processing by introducing a novel framework, DQ-Former, tailored specifically for multimodal emotion recognition in conversations. Unlike previous approaches that either focused on unimodal feature extractions or sophisticated fusion techniques, DQ-Former integrates insights from cognitive science to prioritize and consolidate emotional cues from multiple modalities effectively. This framework utilizes learnable query tokens to effectively collate and condense emotion cues from intermediate layers of both textual encoder and acoustic encoder, allowing for a more nuanced understanding of emotional attributes. By incorporating dynamic modality priority, it ensures that each modality's contribution is appropriately weighted in the final representation, enhancing both robustness and interpretability. The proposed approach not only outperforms previous methods on benchmark datasets but also sheds light on the importance of considering diverse emotional attributes and modalities in multimodal processing tasks.
Supplementary Material: zip
Submission Number: 4939
Loading