Multi-Modal Multi-Task Affective States Recognition Based on Label Encoder Fusion

Maxim Markitantov, Elena Ryumina, Heysem Kaya, Alexey Karpov

Published: 2025, Last Modified: 02 Mar 2026INTERSPEECH 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Despite recent advances in multi-modal approaches, recognizing the full range of human affective states, including emotions and sentiments, remains challenging due to complex interactions between different modalities and the hierarchical nature of affective states. This work presents a novel approach for multi-modal multi-task emotion and sentiment recognition that integrates audio, video, and text data. We introduce a Label Encoder Fusion Strategy, which produces and processes uni-modal emotion and sentiment predictions, which are used alongside modality-specific features during the fusion process to provide additional contextual information. We conduct elaborate multi-corpus experiments on the RAMAS, MELD, and CMU-MOSEI corpora. The proposed approach achieves state-of-the-art performance in both affective tasks. On MELD, we achieve a macro F1 (MF) of 40.9% and 67.02% for emotion and sentiment recognition. On CMU-MOSEI, the mean MF is 62.30% and MF is 62.00% for the same tasks.

External IDs:dblp:conf/interspeech/MarkitantovRK025