BReSK: Bootstrapped Contrastive Representation Learning for Skeleton-Based Action Understanding

Published: 11 May 2026, Last Modified: 11 May 2026AERO-HPR 2026 PosterEveryoneRevisionsCC BY 4.0
Track: Non-Proceedings Track
Keywords: Self supervise learning, Contrastive learning, Skeleton based action recognition
Abstract: Self-supervised learning especially contrastive learning has emerged as a powerful paradigm for skeleton-based action recognition. However, existing approaches often rely on heavy architectural refinements, such as transformer-based modules, which introduce redundancy and increase model complexity without necessarily improving representation consistency. To address this issue, we propose BReSK, a self-supervised framework that combines a bootstrap prediction with momentum contrastive learning for skeleton-based action understanding. At the core of BReSK is DiP, an asymmetric dual-branch predictor that enforces cross-view consistency through spatial and temporal predictors in the Query branch, while using exponentially moving averaged (EMA) targets in the Key branch to stabilize representation learning. In addition, we introduce BoCL, a hybrid objective that jointly optimizes a bootstrap alignment loss ($\mathcal{L}_{\text{CroP}}$) and a momentum-based contrastive loss ($\mathcal{L}_{\text{MiCo}}$), improving instance discrimination while reducing class confusion in the embedding space. Extensive experiments on six benchmark datasets, including NTU-RGB+D 60/120, PKU-MMD, Toyota SmartHome, Penn Action, and Posetics, show that BReSK consistently outperforms state-of-the-art methods across diverse settings while using fewer parameters.
Supplementary Material: pdf
Submission Number: 12
Loading