Unleashing the Power of Visual Prompting At the Pixel Level

Published: 28 May 2024, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper presents a simple and effective visual prompting method for adapting pre-trained models to downstream recognition tasks. Our approach is underpinned by two key designs. First, rather than directly adding together the prompt and the image, we treat the prompt as an extra and independent learnable entity. We show that the strategy of reconciling the prompt and the image matters, and find that warping the prompt around a properly shrinked image empirically works the best. Second, we re-introduce two “old tricks” commonly used in building transferable adversarial examples, i.e., input diversity and gradient normalization, into the realm of visual prompting. These techniques improve optimization and enable the prompt to generalize better. We provide extensive experimental results to demonstrate the effectiveness of our method. Using a CLIP model, our prompting method registers a new record of 82.5% average accuracy across 12 popular classification datasets, substantially surpassing the prior art by +5.2%. It is worth noting that such performance not only surpasses linear probing by +2.2%, but, in certain datasets, is on par with the results from fully fine-tuning. Additionally, our prompting method shows competitive performance across different data scales and against distribution shifts.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Thanks very much to the AE and reviewers for your valuable suggestions. Your advice has greatly improved the quality of this paper. In response to your suggestions, we have made the following revisions: First, we have added an analysis of the literature on prompt reprogramming, ILM, and AutoVP. In addition, we have included the experiments and discussions mentioned during the rebuttal stage in both the main text and the supplementary materials, to make this paper more comprehensive. Once again, we sincerely thank you all for your valuable suggestions, which have made this paper more complete.
Code: https://github.com/UCSC-VLAA/EVP
Assigned Action Editor: ~Pin-Yu_Chen1
Submission Number: 2367
Loading