Self-supervised video pretraining yields strong image representationsDownload PDF

Published: 01 Feb 2023, Last Modified: 12 Mar 2024Submitted to ICLR 2023Readers: Everyone
Keywords: self-supervised, contrastive, video, representation learning, image segmentation, object detection
TL;DR: We achieve SoTA transfer to image scene understanding tasks using frame-based models pre-trained using contrastive learning on videos.
Abstract: Videos contain far more information than still images, and hold the potential for learning rich representations of the visual world. Yet, pretraining on image datasets has remained the dominant paradigm for learning representations that capture spatial information and previous attempts at video pretraining have fallen short on image understanding tasks. In this work we revisit self-supervised learning of image representations from the dynamic evolution of video frames. To that end, we propose a dataset curation procedure that addresses the domain mismatch between video and image datasets, and develop a contrastive learning framework which handles the complex transformations present in natural videos. This simple paradigm for distilling knowledge from videos to image representations, called VITO, performs surprisingly well on a variety of image-based transfer learning tasks. For the first time, our video-pretrained model closes the gap with ImageNet pretraining on semantic segmentation on PASCAL and ADE20k and object detection on COCO and LVIS, raising the possibility of video-pretraining becoming the new default for learning image representations.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:2210.06433/code)
19 Replies

Loading