Self-Supervised Learning of Motion-Informed LatentsDownload PDF

29 Sept 2021 (modified: 13 Feb 2023)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone
Keywords: Representation learning, self-supervised learning, video representation learning, pose estimation
Abstract: Siamese network architectures trained for self-supervised instance recognition can learn powerful visual representations that are useful in various tasks. Many such approaches work by simply maximizing the similarity between representations of augmented images of the same object. In this paper, we further expand on the success of these methods by studying an unusual training scheme for learning motion-informed representations. Our goal is to show that common Siamese networks can effectively be trained on video sequences to disentangle attributes related to pose and motion that are useful for video and non-video tasks, yet typically suppressed in usual training schemes. Unlike parallel efforts that focus on introducing new image-space operators for data augmentation, we argue that extending the augmentation strategy by using different frames of a video leads to more powerful representations. To show the effectiveness of this approach, we use the Objectron and UCF101 datasets to learn representations and evaluate them on pose estimation, action recognition, and object re-identification. We show that self-supervised learning using in-domain video sequences yields better results on different task than fine-tuning pre-trained networks on still images. Furthermore, we carefully validate our method against a number of baselines.
One-sentence Summary: A study of self-supervised learning on video frame pairs to learn pose, motion, and geometry -sensitive representations.
Supplementary Material: zip
5 Replies

Loading