Abstract: Recognizing human emotions through behavioral and physiological signals is fundamental to overall health. However, since emotion occurs transiently, a semantics mismatch exists between the uniformly annotated label and multimodal temporal signals, leading to emotional ambiguity. Previous studies used the annotated label to guide feature learning, which makes it intractable to accurately identify emotional elicitation moments within each signal. Moreover, the inconsistency of specific elicitation moments across different signals complicates emotion recognition. The model hardly learns discriminative features due to emotional ambiguity, which weakens its ability to differentiate between emotions. To tackle the above challenges, this paper proposes a novel label feature co-learning model (LFCL) for emotion recognition through multimodal signals. Specifically, the LFCL leverages unimodal and multimodal information and adaptively generates instance-level emotion labels, thus precisely locating emotion elicitation moments within each signal. To promote emotion consistency across different signals, the LFCL incorporates a dynamic label calibration mechanism to balance the label generation process with historical information. Furthermore, to enhance the deep interaction between signals, the LFCL conducts multimodal interactive fusion to integrate multi-level multimodal features and extract global emotional information. The LFCL performs precise label-to-feature alignment to capture discriminative features of each signal, effectively alleviating emotional ambiguity and improving the ability to distinguish different emotions. Extensive experiments on three publicly available datasets demonstrate the effectiveness and generalization of the proposed model.
External IDs:dblp:journals/taffco/CaiCLZ25
Loading