Unleashing the Power of Visual Prompting At the Pixel Level

22 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: prompting, CLIP, computer vision
Abstract: This paper presents a simple and effective visual prompting method for adapting pre-trained models to downstream recognition tasks. Our approach is underpinned by two key designs. First, rather than directly adding together the prompt and the image, we treat the prompt as an extra and independent learnable entity. We show that the strategy of reconciling the prompt and the image matters, and find that warping the prompt around a properly shrinked image empirically works the best. Second, we re-introduce two “old tricks” commonly used in building transferable adversarial examples, i.e., input diversity and gradient normalization, into the realm of visual prompting. These techniques improve optimization and enable the prompt to generalize better. We provide extensive experimental results to demonstrate the effectiveness of our method. Using a CLIP model, our prompting method registers a new record of 82.5% average accuracy across 12 popular classification datasets, substantially surpassing the prior art by +5.2%. It is worth noting that such performance not only surpasses linear probing by +2.2%, but, in certain datasets, is on par with the results from fully fine-tuning. Additionally, our prompting method shows competitive performance across different data scales and against distribution shifts.
Supplementary Material: pdf
Primary Area: representation learning for computer vision, audio, language, and other modalities
