Multimodal Emotion Recognition with Temporal Slicing Encoder and Attention-Enhanced Synergy Integration

Haoyu Wang, Bengong Yu, Zhonghao Xi, Shuping Zhao, Ying Yang

Published: 2026, Last Modified: 26 May 2026IEEE Trans. Multim. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In the realm of emotion recognition concerning continuous temporal sequence data, scholars have delved into various effective integration strategies from multiple perspectives, yielding commendable results. The majority of these studies have comfortably relied on Long Short-Term Memory (LSTM) networks to extract features from both video and audio, often overlooking the thorough extraction of underlying features prior to integration. We have entirely eschewed convolutional and recurrent architectures, opting instead to design a simple, stackable Temporal Slicing Encoder (TSE) to distill temporal characteristics. Empirical evidence from two sentiment analysis datasets demonstrates that the TSE module excels in the extraction of emotional features. Building upon this foundation, we have further explored modality interaction, addressing cross-modal data activation and synergy optimization between different features, devising the Deep Bimodal Information Transfer Module (DBIT) and the Dynamic Synergy Optimization Network (DSON), which, in conjunction with the TSE module, form our TASE-Net (Temporal Attention Synergy Emotion Network). The DBIT module establishes a cross-attention mechanism guided by mutual information to facilitate text-guided cross-modal data activation, while the DSON module achieves adaptive emotional feature confidence allocation and knowledge transfer between trimodal and unimodal features through an Emotional Weight Adjuster (EWA) and an Asymmetric Bidirectional Distillator (ABD). Extensive experiments on the CMU-MOSI and CMU-MOSEI datasets substantiate the efficacy and advancement of our TASE-Net and TSE encoder.

External IDs:dblp:journals/tmm/WangYXZY26