Abstract: There are two competing standards for self-supervised
learning in action recognition from 3D skeletons. Su et
al., 2020 [31] used an auto-encoder architecture and an
image reconstruction objective function to achieve stateof-the-art performance on the NTU60 C-View benchmark.
Rao et al., 2020 [23] used Contrastive learning in the
latent space to achieve state-of-the-art performance on
the NTU60 C-Sub benchmark. Here, we reconcile these
disparate approaches by developing a taxonomy of selfsupervised learning for action recognition. We observe that
leading approaches generally use one of two types of objective functions: those that seek to reconstruct the input
from a latent representation (“Attractive” learning) versus
those that also try to maximize the representations distinctiveness (“Contrastive” learning). Independently, leading
approaches also differ in how they implement these objective functions: there are those that optimize representations
in the decoder output space and those which optimize representations in the network’s latent space (encoder output).
We find that combining these approaches leads to larger
gains in performance and tolerance to transformation than
is achievable by any individual method, leading to state-ofthe-art performance on three standard action recognition
datasets. We include links to our code and data.
Loading