Abstract: Most recent self-supervised learning methods are pre-trained on the well-curated ImageNet-1K dataset.
In this work, given the excellent scalability of web data, we consider self-supervised pre-training on noisy web sourced image-text paired data.
First, we conduct a benchmark study of representative self-supervised pre-training methods on large-scale web data in a like-for-like setting.
We compare a range of methods, including single-modal ones that use masked training objectives and multi-modal ones that use image-text constrastive training.
We observe that existing multi-modal methods do not outperform their single-modal counterparts on vision transfer learning tasks.
We derive an information-theoretical view to explain these benchmark results, which provides insight into how to design a novel vision learner.
Inspired by this insight, we present a new visual representation pre-training method, MUlti-modal Generator~(MUG), that learns from scalable web sourced image-text data.
MUG achieves state-of-the-art transfer performance on a variety of tasks and demonstrates promising scaling properties.
Pre-trained models and code will be made public upon acceptance.
Submission Length: Regular submission (no more than 12 pages of main content)
Code: https://huggingface.co/spaces/tennant/MUG_caption
Assigned Action Editor: ~Jia-Bin_Huang1
Submission Number: 2039
Loading