Abstract: Self-supervised emotion recognition leveraging skeleton-based data offers a promising approach for classifying emotional expressions within the extensive amount of unlabeled data gathered by sensors in the Internet of Things (IoT). Recent advancements in this field have been driven by contrastive learning-based or generative learning-based self-supervised methods, which effectively tackle the issue of sparsely labeled data. In emotion recognition tasks, the emotional high-level semantics embedded in the skeleton data are more important than the subtle joint movements. Compared to existing methods, discrete label prediction can encourage self-supervised learning models to abstract high-level semantics in a manner similar to human perception. However, it is challenging to comprehensively capture emotional expressed in skeleton data solely from joint-based features. Moreover, emotional information conveyed through body movements may include redundant details that hinder the understanding of emotional expression. To overcome these challenges, we propose a novel discrete-label-based emotion recognition framework named the appendage-informed redundancy-ignoring (AIR) discrete label framework. First, we introduce the appendage-skeleton partitioning (ASP) module, which leverages limb movement data from the original skeleton to explore emotional expression. Next, we propose the appendage-refined multiscale discrete label (AMDL) module, which transforms traditional self-supervised tasks into classification tasks. This design continuously extracts emotional semantics from skeleton data during pretraining, functioning similarly to predicting categories and subsequently classifying samples. To further reduce the nonessential information in skeleton data that may negatively impact the generation of accurate emotional categories, we propose the appendage label refinement (ALR) module. It refines the generated categories by using the relationships between the skeleton and the various appendages obtained via the ASP module. Finally, to maintain consistency across multiple scales, we introduce the multigranularity appendage alignment (MGAA) method. By incorporating features from both coarse and fine scales, MGAA mitigates the encoder’s sensitivity to noise and enhances its overall robustness. We evaluate our approach on the Emilya, EGBM, and KDAE datasets, where it consistently outperforms state-of-the-art methods under various evaluation protocols.
External IDs:dblp:journals/iotj/ZhangLWZLH25
Loading