Keywords: robustness, foundation models, CLIP, LAION, ImageNet, generalization, OOD robustness, distribution shift, vision language models, self-supervised learning, contrastive learning, ObjectNet, ImageNet-R, ImageNet-Sketch, ImageNet-A, ImageNet-V2
TL;DR: CLIP's ability to generalize to standard OOD benchmarks does not mainly stem from highly similar images in its training dataset.
Abstract: Foundation models like CLIP are trained on hundreds of millions of samples and effortlessly generalize to new tasks and inputs. Out of the box, CLIP shows stellar zero-shot and few-shot capabilities on a wide range of out-of-distribution (OOD) benchmarks, which prior works attribute mainly to today's large and comprehensive training dataset (like LAION). However, it is questionable how meaningful terms like out-of-distribution generalization are for CLIP as it seems likely that web-scale datasets like LAION simply contain many samples that are similar to common OOD benchmarks originally designed for ImageNet. To test this hypothesis, we retrain CLIP on pruned LAION splits that replicate ImageNet’s train-test similarity with respect to common OOD benchmarks. While we observe a performance drop on some benchmarks, surprisingly, CLIP’s overall performance remains high. This shows that high train-test similarity is insufficient to explain CLIP’s performance.
Submission Number: 42
Loading