Keywords: visual prompt, parameter efficient finetuning, learning theory, generalization analysis
TL;DR: We propose activation prompting to analyze, understand, and advance the conventional visual prompting for parameter efficient transfer learning.
Abstract: Visual prompting (VP) has emerged as a popular method to repurpose large pretrained models for downstream vision tasks. Unlike many parameter-efficient fine-tuning (PEFT) techniques that modify model parameters, VP introduces a universal perturbation directly into the input data to facilitate task-specific fine-tuning while keeping the pretrained model intact. However, there exists a noticeable performance gap between VP and conventional fine-tuning methods, highlighting an unexplored realm in theory and practice to understand and advance VP to close its performance gap. Towards this end, we introduce a novel concept, termed activation prompt (AP), which extends the scope of input-level VP by enabling universal perturbations to be applied to activation maps within the intermediate layers of the model. With the aid of AP, we unveil the intrinsic limitations of VP in both performance and efficiency. We also show that AP shares a close connection to normalization tuning used in convolutional neural networks (CNNs) and vision transformers (ViTs), albeit with variations in layer preferences for prompting. We theoretically elucidate the rationale behind such preference by analyzing global features across layers. By conducting extensive experiments across 29 datasets and various model architectures, we provide a thorough performance analysis of AP, comparing it with VP and PEFT baselines. Our experimental results demonstrate that AP significantly surpasses the input-level VP in terms of both accuracy and efficiency, considering factors like time, parameters, memory usage, and throughout.
Submission Number: 56
Loading