Implicitly using Human Skeleton in Self-supervised Learning: Influence on Spatio-temporal Puzzle Solving and on Video Action Recognition

Mathieu Riand, Laurent Dollé, Patrick Le Callet

Published: 2021, Last Modified: 10 Apr 2026ROBOVIS 2021EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper we studied the influence of adding skeleton data on top of human actions videos when performing self-supervised learning and action recognition. We show that adding this information without additional constraints actually hurts the accuracy of the network; we argue that the added skeleton is not considered by the network and seen as a noise masking part of the natural image. We bring first results on puzzle solving and video action recognition to support this hypothesis.