Abstract: Visual prompting is an efficient methodology for finetuning pretrained visual models by introducing a small number of learnable parameters while keeping the backbone frozen. However, most existing visual prompting methods learn a shared prompt for all samples, making it challenging to grasp distinct characteristics among diverse samples, thereby limiting the model's performance. While other methods partially address this issue through sample clustering and learning multiple prompts, they still struggle to capture nuanced differences among instances and incur significant parameter overhead. Therefore, to comprehensively and efficiently leverage discriminative characteristics of individual instances, we propose an Instance Visual Prompting method, called InsVP. Initially, the instance image prompt is introduced to extract both crucial and nuanced discriminative information from the original image itself and is overlaid onto the input image. Furthermore, the instance feature prompt is designed to capture both commonalities and characteristics among individual instances, fed into the model's intermediate layers to facilitate feature extraction. Consequently, the instance image and feature prompts complement each other, enhancing the adaptation ability of pretrained models to extract discriminative features from individual instances. Extensive experiments on various large-scale benchmarks show that our InsVP achieves superior performance exceeding the state-of-the-art methods at a lower parameter cost. Our code will be released.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: In the last few years, with the advancement of deep learning, the size of multi-modal models has significantly increased. At the same time, the downstream tasks and application scenarios for multi-modal models are expanding. Given the complex tasks and the diversity of edge devices, the challenge of quickly and efficiently fine-tuning pretrained multi-modal models under limited storage and computational resource scenarios has become critical. The prompt tuning technique we investigate aims to adapt pretrained models for downstream tasks with minimal computational and storage costs, all while keeping the pretrained backbone unchanged. By incorporating a small number of learnable prompts, we seek to adjust pretrained models to new tasks. Building on the foundation of existing methods, our objective is to further minimize the computational demands of prompt tuning while improving adaptability to downstream tasks, thus enabling the deployment of multi-modal models across a broad array of edge devices.
Supplementary Material: zip
Submission Number: 337
Loading