Robust 2D Skeleton Action Recognition via Decoupling and Distilling 3D Latent Features

Published: 15 Apr 2025, Last Modified: 12 Nov 2025IEEE Transactions on Circuits and Systems for Video TechnologyEveryoneCC BY 4.0
Abstract: Human skeletons provide a compact representation for action recognition. Compared to 3D skeletons, 2D skeletons lack view-independence and depth, making them less robust for motion analysis. However, 3D skeleton data requires specialized hardware, limiting its practicality, especially in outdoor or dynamic settings. In contrast, 2D skeletons can be extracted from standard RGB videos, making them more accessible. To address this, we propose 2D³-SkelAct, a 2D skeleton-based action recognition model. It maps 2D inputs to a 3D latent space, where pose and view features are decoupled. Additionally, 2D³- SkelAct distills motion cues from 3D models, enhancing motion detail capture while keeping the benefits of 2D data. Specifically, the pipeline of our 2D3 -SkelAct consists of two steps: poseview decoupling and pose-view distilling. First, we use a spatiotemporal transformer to decouple 2D skeleton sequences into latent pose and view features, enhancing the model’s ability to learn motion dynamics. Next, these decoupled features are separately integrated into the 2D skeleton model through two cross-attention modules, allowing it to extract discriminative motion features while mitigating uncertainties in 3D viewpoint and depth. Additionally, we distill motion cues from 3D models to compensate for the limitations of 2D skeletons. Remarkably, our model can seamless integrate with various skeleton feature extractors. We validate the proposed 2D3 -SkelAct through extensive experiments, demonstrating its adaptability across different model architectures as where consistent improvement achieving. When combined with advanced skeleton feature extractors, 2D3 - SkelAct achieves state-of-the-art performance in 2D skeleton-based action recognition.
Loading