Abstract: CLIP, widely used in multimodal learning, excels due to its large-scale image-text pretraining. However, applying CLIP-like architectures to skeleton-based action representation learning presents challenges due to the incompatible non-visual data structure and the limited scale of skeleton datasets, which hinder robust generalization. To address these issues, we propose SKL-CLIP, a framework that incorporates Supervised Self-Contrastive Learning to mitigate overfitting and enhance transferable representation learning, Knowledge Distillation from the textual encoder of pretrained CLIP models to preserve generalization while adapting to skeleton-text scenarios, and Multi-Domain Parallel Training to leverage diverse support datasets, improving cross-dataset and zero-shot recognition. Extensive experiments on NTU and PKU datasets demonstrate that SKL-CLIP significantly advances skeleton-based action representation, achieving state-of-the-art performance across fully-supervised, cross-data unsupervised domain adaptation, and zero-shot tasks.
External IDs:dblp:conf/icmcs/WangCGLL25
Loading