Multi-Level Skeleton Self-Supervised Learning: Enhancing 3D action representation learning with Large Multimodal Models

Tian He, Yang Chen, Xu Gao, Ling Wang, Rui Huang, Hong Cheng

Published: 01 Jun 2025, Last Modified: 23 Nov 2025Knowledge-Based SystemsEveryoneRevisionsCC BY-SA 4.0
Abstract: Self-Supervised Learning (SSL) has proven effective in skeleton-based action understanding, drawing increasing research attention. Previous studies mainly focus on capturing the relationship between joints and skeleton sequences through joint-level masked motion modeling and sequence-level contrastive learning. However, these methods overlook subtle semantic connections between similar movements, leading to poor feature discrimination of such actions and impacting downstream task performance. In this paper, we propose a Multi-Level Skeleton Self-Supervised Learning (MLS3<math><msup is="true"><mrow is="true"></mrow><mrow is="true"><mn is="true">3</mn></mrow></msup></math>L) framework that integrates joint, sequence, and semantic-level SSL in a complementary manner for fine-grained action understanding. Specifically, We first design topology-based mask reconstruction for joint-level SSL and tempo-independent contrastive learning for sequence-level SSL. For semantic-level SSL, we leverage pre-trained Large Multimodal Models (LMMs) to generate discriminative text descriptions for action sequences. Then, we design a weighted soft alignment algorithm to align text descriptions with the corresponding skeletons. This semantic-level representation distillation significantly enhances the ability to distinguish between similar actions. Furthermore, we propose a multi-level collaboration strategy to enable SSL tasks at different levels to jointly learn versatile representations of various granularity, leading to improved learning of action representation features. Our method demonstrates exceptional performance on various downstream tasks, validated on NTU RGB+D, NTU RGB+D 120, and PKUMMD datasets.
Loading