Abstract: We present a comprehensive experimental study on pre-trained feature extractors for visual out-of-distribution (OOD) detection, focusing on leveraging contrastive language-image pre-trained (CLIP) models. Without fine-tuning on the training data, we are able to establish a positive correlation ($R^2\geq0.92$) between in-distribution classification and unsupervised OOD detection for CLIP models in $4$ benchmarks. We further propose a new simple and scalable method called \textit{pseudo-label probing} (PLP) that adapts vision-language models for OOD detection. Given a set of label names of the training set, PLP trains a linear layer using the pseudo-labels derived from the text encoder of CLIP. Intriguingly, we show that without modifying the weights of CLIP or training additional image/text encoders (i) PLP outperforms the previous state-of-the-art on all $5$ large-scale benchmarks based on ImageNet, specifically by an average AUROC gain of 3.4\% using the largest CLIP model (ViT-G), (ii) linear probing outperforms fine-tuning by large margins for CLIP architectures (i.e. CLIP ViT-H achieves a mean gain of 7.3\% AUROC on average on all ImageNet-based benchmarks), and (iii) billion-parameter CLIP models still fail at detecting feature-based adversarially manipulated OOD images. The code is available at https://github.com/HHU-MMBS/plp-official-tmlr2024.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=Us4bgZ1nGd¬eId=xA8q8lvbZg
Changes Since Last Submission: There was a latex package that altered the TMLR template (specifically \usepackage{times}), and thus, the previous submission was desk rejected. We have now addressed this issue and resubmitted it with the correct template.
Video: https://www.youtube.com/watch?v=eVUKVfxpx8A
Code: https://github.com/HHU-MMBS/plp-official-tmlr2024
Supplementary Material: pdf
Assigned Action Editor: ~Vincent_Dumoulin1
Submission Number: 1803
Loading