Keywords: visual pretraining, self supervised learning, bootstrapping
TL;DR: A RL-like approach for pre-training on all types of image data (labeled and unlabeled), focused on learning ``web-scale'' datasets like web-crawls and videos.
Abstract: A common approach to learning from unlabeled images is to train models to satisfy invariances on these images, such as consistency under augmentations or crops. Despite successes on Imagenet, these approaches struggle to learn from larger uncurated datasets like web crawls or video, where such inductive biases only weakly hold. How can we more effectively learn from broader datasets? Instead of training models to be invariant across views, we study an alternative approach encouraging model representations to be \textit{predictive} of important semantics of adjacent views of an image. We concurrently train a model to predict semantic annotations from images (generated either self-supervised, or from auxiliary datasets); and bootstrap the model's semantics by predicting, given a cropped view of an image and the coordinates for a nearby crop, the model's annotation distribution for the neighboring view.  A core strength of this approach is the ability to extract information universally from both unlabelled and labelled image data, incorporating captions, bounding boxes, and other annotations when they are present. Our experiments show that annotation propagation improves pre-training on unlabelled datasets in the wild, including video datasets like EpicKitchens, scene datasets like COCO, and uncurated web-scale image datasets like CC12M.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13043
Loading