How and What to Learn: Taxonomizing Self-Supervised Learning for 3D Action Recognition

Omar Ben Tanfous, Aimen Zerroug, Drew Linsley, Thomas Serre

Published: 02 Jan 2022, Last Modified: 15 May 2025WACV 2022EveryoneCC BY 4.0

Abstract: There are two competing standards for self-supervised learning in action recognition from 3D skeletons. Su et al., 2020 [31] used an auto-encoder architecture and an image reconstruction objective function to achieve stateof-the-art performance on the NTU60 C-View benchmark. Rao et al., 2020 [23] used Contrastive learning in the latent space to achieve state-of-the-art performance on the NTU60 C-Sub benchmark. Here, we reconcile these disparate approaches by developing a taxonomy of selfsupervised learning for action recognition. We observe that leading approaches generally use one of two types of objective functions: those that seek to reconstruct the input from a latent representation (“Attractive” learning) versus those that also try to maximize the representations distinctiveness (“Contrastive” learning). Independently, leading approaches also differ in how they implement these objective functions: there are those that optimize representations in the decoder output space and those which optimize representations in the network’s latent space (encoder output). We find that combining these approaches leads to larger gains in performance and tolerance to transformation than is achievable by any individual method, leading to state-ofthe-art performance on three standard action recognition datasets. We include links to our code and data.