Vi-MIX FOR SELF-SUPERVISED VIDEO REPRESENTATION

Srijan Das; Michael S Ryoo

Vi-MIX FOR SELF-SUPERVISED VIDEO REPRESENTATION

Srijan Das, Michael S Ryoo

29 Sept 2021 (modified: 13 Feb 2023)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone

Keywords: data augmentation, self-supervision, video representation

Abstract: Contrastive representation learning of videos highly rely on exhaustive data aug- mentation strategies. Therefore, towards designing video augmentation for self- supervised learning, we first analyze the best strategy to mix videos to create a new augmented video sample. Then, the question remains, can we make use of the other modalities in videos for data mixing? To this end, we propose Cross-Modal Manifold Cutmix (CMMC) that inserts a video tesseract into an- other video tesseract in the feature space across two different modalities. We find that our video mixing strategy: Vi-Mix, i.e. preliminary mixing of videos followed by CMMC across different modalities in a video, improves the qual- ity of learned video representations. We exhaustively conduct experiments for two downstream tasks: action recognition and video retrieval on three popular video datasets UCF101, HMDB51, and NTU-60. We show that the performance of Vi-Mix on both the downstream tasks is on par with the other self-supervised approaches while requiring less training data.

One-sentence Summary: A novel strategy to mix videos for learning discriminative self-supervised video representation.

5 Replies

Loading