Interpreting CLIP: Insights on the Robustness to ImageNet Distribution Shifts

TMLR Paper2602 Authors

30 Apr 2024 (modified: 01 Aug 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: What distinguishes robust models from non-robust ones? While for ImageNet distribution shifts it has been shown that such differences in robustness can be traced back predominantly to differences in training data, so far it is not known what that translates to in terms of what the model has learned. In this work, we bridge this gap by probing the representation spaces of 16 robust CLIP vision encoders with various backbones (ResNets and ViTs) and pretraining sets (OpenAI, LAION-400M, LAION-2B, YFCC15M, CC12M and DataComp), and comparing them to the representation spaces of less robust models with identical backbones, but different (pre)training sets or objectives (CLIP pretraining on ImageNet-Captions, and supervised training or finetuning on ImageNet). Through this analysis, we generate three novel insights. Firstly, we detect the presence of outlier features in the robust zero-shot CLIP vision encoders, which to the best of our knowledge is the first time these are observed in non-language and non-transformer models. Secondly, we find the existence of outlier features to be a signature of ImageNet shift robustness in models, since we only find them in robust models in our analysis. Lastly, we also investigate the number of unique encoded concepts in the representation space and find zero-shot CLIP models to encode a higher number of unique concepts in their representation space. However, we find this to be rather a signature of language supervision than a signature of ImageNet shift robustness.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Pavel_Izmailov1
Submission Number: 2602
Loading