Towards Robustness Prompt Tuning with Fully Test-Time Adaptation for CLIP's Zero-Shot Generalization

Published: 20 Jul 2024, Last Modified: 30 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In the field of Vision-Language Models (VLM), the Contrastive Language-Image Pretraining (CLIP) model has yielded outstanding performance on many downstream tasks through prompt tuning. By integrating image and text representations, CLIP exhibits zero-shot generalization capabilities on unseen data. However, when new categories and distribution shifts occur, the pretrained text embeddings in CLIP may not align well with unseen images, potentially leading to a decrease in CLIP's zero-shot generalization performance. To address this issue, many existing methods use test samples to update the CLIP model during testing through a process known as Test-Time Adaptation (TTA). Previous TTA techniques, such as image augmentation, can lead to overfitting given outlying samples, while methods based on teacher-student distillation can increase memory use. Further, these methods significantly increase inference time, which is a crucial factor in the testing phase. To improve robustness, mitigate overfitting, and reduce bias toward outlying samples, we propose a novel method: Self-Text Distillation with Conjugate Pseudo-labels (SCP), designed to enhance CLIP's zero-shot generalization. SCP uses gradient information from conjugate pseudo-labels to enhance the model’s robustness toward distribution shifts. It also innovates by using a fixed prompt list to distil learnable prompts from within the same model, acting as a self-regulation mechanism that minimizes overfitting. Additionally, SCP is a fully test-time adaptation method that does not require retraining. It directly improves CLIP's zero-shot generalization at test time without increasing either memory overheads or inference time. In fact, in evaluations across three zero-shot generalization scenarios, SCP surpasses existing state-of-the-art methods in performance and significantly reduces inference time.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: The paper introduces SCP, a novel approach designed to enhance the zero-shot generalization capabilities of the vision-language model, CLIP. This methodology addresses key challenges in multimodal processing by innovating in two main areas: enhancing the robustness of CLIP against distribution shifts and minimizing overfitting towards outlier samples. SCP leverages the inherent strengths of CLIP in integrating image and text representations, extending its utility in multimodel processing through adaptive prompt tuning. Notably, the approach is fully test-time, meaning it directly improves CLIP’s zero-shot generalization at the point of inference. This presents a significant advantage in multimodal applications where computational efficiency and the ability to adapt to new data on-the-fly are paramount. SCP’s evaluation across three zero-shot generalization scenarios demonstrates its superiority over state-of-the-art methods, not just in performance but also in significantly reducing inference time. This makes it particularly relevant for multimodal processing tasks that demand rapid and robust performance across varied and evolving content distributions.
Supplementary Material: zip
Submission Number: 3040
Loading