Caption supervision enables robust learners: a controlled study of distributionally robust model training

Benjamin Feuer; Ameya Joshi; Chinmay Hegde

Caption supervision enables robust learners: a controlled study of distributionally robust model training

Benjamin Feuer, Ameya Joshi, Chinmay Hegde

Published: 01 Feb 2023, Last Modified: 13 Feb 2023Submitted to ICLR 2023Readers: Everyone

Keywords: self-supervised learning, computer vision, effective robustness, vision language, CLIP, ImageNet, LAION, CC12M, YFCC

TL;DR: We introduce CaptionNet, a fully captioned, fully supervised dataset with ImageNet-compliant labels, and through experiment, show how the choice of loss function, data filtration and supervision strategy enable robust computer vision.

Abstract: Vision language models like CLIP are robust to natural distribution shifts, in part because CLIP learns on unstructured data using a technique called caption supervision; the model inteprets image-linked texts as ground-truth labels. In a carefully controlled comparison study, we show that CNNs trained on a standard cross-entropy loss can also benefit from caption supervision, in some cases even more than VL models, on the same data. To facilitate future experiments with high-accuracy caption-supervised models, we introduce CaptionNet, one piece of which is a class-balanced, fully supervised dataset with over 50,000 new human-labeled ImageNet-compliant samples which includes web-scraped captions. In a series of experiments on CaptionNet, we show how the choice of loss function, data filtration and supervision strategy enable robust computer vision.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Unsupervised and Self-supervised learning

Supplementary Material: zip

12 Replies

Loading