Abstract: Human skeletons provide a compact representation
for action recognition. Compared to 3D skeletons, 2D skeletons
lack view-independence and depth, making them less robust for
motion analysis. However, 3D skeleton data requires specialized
hardware, limiting its practicality, especially in outdoor or
dynamic settings. In contrast, 2D skeletons can be extracted
from standard RGB videos, making them more accessible. To
address this, we propose 2D³-SkelAct, a 2D skeleton-based action
recognition model. It maps 2D inputs to a 3D latent space,
where pose and view features are decoupled. Additionally, 2D³-
SkelAct distills motion cues from 3D models, enhancing motion
detail capture while keeping the benefits of 2D data. Specifically,
the pipeline of our 2D3
-SkelAct consists of two steps: poseview decoupling and pose-view distilling. First, we use a spatiotemporal transformer to decouple 2D skeleton sequences into
latent pose and view features, enhancing the model’s ability
to learn motion dynamics. Next, these decoupled features are
separately integrated into the 2D skeleton model through two
cross-attention modules, allowing it to extract discriminative
motion features while mitigating uncertainties in 3D viewpoint
and depth. Additionally, we distill motion cues from 3D models
to compensate for the limitations of 2D skeletons. Remarkably,
our model can seamless integrate with various skeleton feature
extractors. We validate the proposed 2D3
-SkelAct through extensive experiments, demonstrating its adaptability across different
model architectures as where consistent improvement achieving.
When combined with advanced skeleton feature extractors, 2D3
-
SkelAct achieves state-of-the-art performance in 2D skeleton-based action recognition.
Loading