Vi-MIX FOR SELF-SUPERVISED VIDEO REPRESENTATIONDownload PDF

29 Sept 2021 (modified: 13 Feb 2023)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone
Keywords: data augmentation, self-supervision, video representation
Abstract: Contrastive representation learning of videos highly rely on exhaustive data aug- mentation strategies. Therefore, towards designing video augmentation for self- supervised learning, we first analyze the best strategy to mix videos to create a new augmented video sample. Then, the question remains, can we make use of the other modalities in videos for data mixing? To this end, we propose Cross-Modal Manifold Cutmix (CMMC) that inserts a video tesseract into an- other video tesseract in the feature space across two different modalities. We find that our video mixing strategy: Vi-Mix, i.e. preliminary mixing of videos followed by CMMC across different modalities in a video, improves the qual- ity of learned video representations. We exhaustively conduct experiments for two downstream tasks: action recognition and video retrieval on three popular video datasets UCF101, HMDB51, and NTU-60. We show that the performance of Vi-Mix on both the downstream tasks is on par with the other self-supervised approaches while requiring less training data.
One-sentence Summary: A novel strategy to mix videos for learning discriminative self-supervised video representation.
5 Replies

Loading