Understanding Model Reprogramming for CLIP via Decoupling Visual Prompts

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Decoupling target task with specialized prompts for each unlock visual prompting's potential.
Abstract: Model reprogramming adapts pretrained models to downstream tasks by modifying only the input and output spaces. *Visual reprogramming* (VR) is one instance for vision tasks that adds a trainable noise pattern (i.e., a visual prompt) to input images to facilitate downstream classification. The existing VR approaches for CLIP train a single visual prompt using all descriptions of different downstream classes. However, the limited learning capacity may result in (1) a failure to capture diverse aspects of the descriptions (e.g., shape, color, and texture), and (2) a possible bias toward less informative attributes that do not help distinguish between classes. In this paper, we introduce a decoupling-and-reweighting framework. Our *decoupled visual prompts* (DVP) are optimized using descriptions grouped by explicit **c**au**se**s (DVP-cse) or unsupervised **cl**u**s**ters (DVP-cls). Then, we integrate the outputs of these visual prompts with a *probabilistic reweighting matrix* (PRM) that measures their contributions to each downstream class. Theoretically, DVP lowers the empirical risk bound. Experimentally, DVP outperforms baselines on average across 11 downstream datasets. Notably, the DVP-PRM integration enables insights into how individual visual prompts influence classification decisions, providing a probabilistic framework for understanding reprogramming.
Lay Summary: This paper presents Decoupled Visual Prompts (DVP) to improve how vision-language models like CLIP adapt to new tasks without retraining. Current methods train a single "visual prompt" (i.e., a small-scale, learnable pattern added to images) to align images with text descriptions, but this can miss important visual details or focus on less useful features. DVP solves this by splitting new tasks into smaller parts: it trains multiple prompts, each specialized for different aspects (like shape or color) or groups of similar descriptions. These prompts are then combined using an adaptive reweighting method that learns which features matter most for each task. Experiments show DVP outperforms existing methods across 11 datasets, and it also provides insights into how the reprogrammed vision-language model makes decisions.
Link To Code: https://github.com/tmlr-group/DecoupledVP
Primary Area: General Machine Learning
Keywords: Model Reprogramming, Visual Reprogramming, Vision-Language Model, Visual Prompting
Submission Number: 11510
Loading