View-Invariant Skeleton Action Representation Learning via Motion Retargeting

Published: 01 Jan 2024, Last Modified: 11 Nov 2024Int. J. Comput. Vis. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Current self-supervised approaches for skeleton action representation learning often focus on constrained scenarios, where videos and skeleton data are recorded in laboratory settings. When dealing with estimated skeleton data in real-world videos, such methods perform poorly due to the large variations across subjects and camera viewpoints. To address this issue, we introduce ViA, a novel View-Invariant Autoencoder for self-supervised skeleton action representation learning. ViA leverages motion retargeting between different human performers as a pretext task, in order to disentangle the latent action-specific ‘Motion’ features on top of the visual representation of a 2D or 3D skeleton sequence. Such ‘Motion’ features are invariant to skeleton geometry and camera view and allow ViA to facilitate both, cross-subject and cross-view action classification tasks. We conduct a study focusing on transfer-learning for skeleton-based action recognition with self-supervised pre-training on real-world data (e.g., Posetics). Our results showcase that skeleton representations learned from ViA are generic enough to improve upon state-of-the-art action classification accuracy, not only on 3D laboratory datasets such as NTU-RGB+D 60 and NTU-RGB+D 120, but also on real-world datasets where only 2D data are accurately estimated, e.g., Toyota Smarthome, UAV-Human and Penn Action. Code and models will be publicly available at https://walker-a11y.github.io/ViA-project.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview