Unleashing the Power of Visual Prompting At the Pixel Level

TMLR Paper2367 Authors

11 Mar 2024 (modified: 28 Apr 2024)Decision pending for TMLREveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This paper presents a simple and effective visual prompting method for adapting pre-trained models to downstream recognition tasks. Our approach is underpinned by two key designs. First, rather than directly adding together the prompt and the image, we treat the prompt as an extra and independent learnable entity. We show that the strategy of reconciling the prompt and the image matters, and find that warping the prompt around a properly shrinked image empirically works the best. Second, we re-introduce two “old tricks” commonly used in building transferable adversarial examples, i.e., input diversity and gradient normalization, into the realm of visual prompting. These techniques improve optimization and enable the prompt to generalize better. We provide extensive experimental results to demonstrate the effectiveness of our method. Using a CLIP model, our prompting method registers a new record of 82.5% average accuracy across 12 popular classification datasets, substantially surpassing the prior art by +5.2%. It is worth noting that such performance not only surpasses linear probing by +2.2%, but, in certain datasets, is on par with the results from fully fine-tuning. Additionally, our prompting method shows competitive performance across different data scales and against distribution shifts.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Pin-Yu_Chen1
Submission Number: 2367
Loading