UOTA: Unsupervised Open-Set Task Adaptation Using a Vision-Language Foundation Model
Keywords: Vision-Language model, CLIP, Self-training, Open-set domain adaptation, Out-of-distribution detection
TL;DR: We propose a simple yet effective method to improve CLIP's transferability in a new, practical scenario where only open-set unlabeled data exist.
Abstract: Human-labeled data is essential for deep learning models, but annotation costs hinder their use in real-world applications. Recently, however, models such as CLIP have shown remarkable zero-shot capabilities through vision-language pre-training. Although fine-tuning with human-labeled data can further improve the performance of zero-shot models, it is often impractical in low-budget real-world scenarios. In this paper, we propose an alternative algorithm, dubbed Unsupervised Open-Set Task Adaptation (UOTA), which fully leverages the large amounts of open-set unlabeled data collected in the wild to improve pre-trained zero-shot models in real-world scenarios.
Submission Number: 36