Keywords: Multimodal Emotion Recognition, Heterogeneous Graph Neural Network, Context-aware Graph Transformer, Graph-driven Multimodal Representation Learning
Abstract: Multimodal emotion recognition (MER) aims to infer human affect from verbal, vocal, and visual signals, a core challenge in representation learning for human–AI interaction. State-of-the-art approaches, including standard Transformers and graph-based models, often collapse modalities into uniform structures, ignoring modality-specific temporal dynamics and asymmetric dependencies. We propose a novel context-aware heterogeneous graph-driven representation learning that explicitly encodes both structural and semantic heterogeneity. Each modality is first contextualized with dedicated Transformer encoders, enriching unimodal features before graph construction. We then introduce a relation-aware graph transformer that performs type-conditioned message passing, enabling specialized transformations across sequential, cross-modal, and speaker-conditioned edges. The topology is adapted to the target regime: in multi-party dialogue (IEMOCAP, MELD), we distinguish within-speaker and cross-speaker temporal flows, while in single-speaker videos (CMU-MOSEI), we extend k-step temporal links to capture offset dynamics. In both settings, co-temporal edges synchronize audio, visual, and textual cues. Experiments demonstrate consistent gains over prior state-of-the-art, showing that structural and semantic heterogeneity are indispensable for robust multimodal representation learning. Our results establish that explicitly modeling interaction structure, rather than relying on generic sequence attention, is critical for advancing multimodal learning. To support reproducibility and further research, we will release our source code.
Primary Area: learning on graphs and other geometries & topologies
Submission Number: 21069
Loading