Visual Prompting Reimagined: The Power of Activation Prompts

18 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: transfer learning, meta learning, and lifelong learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: parameter-efficient fine-tuning, transfer learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We propose activation prompting to analyze, understand, and advance the conventional visual prompting for parameter efficient transfer learning.
Abstract: Visual prompting (VP) has emerged as a popular method to repurpose large pretrained models for downstream vision tasks. Unlike many parameter-efficient finetuning (PEFT) techniques that modify model parameters, VP introduces a universal perturbation directly into the input data to facilitate task-specific finetuning while keeping the pretrained model intact. However, there exists a noticeable performance gap between VP and conventional finetuning methods, highlighting an unexplored realm in theory and practice to understand and advance VP to close its performance gap. Towards this end, we introduce a novel concept, termed activation prompt (AP), which extends the scope of input-level VP by enabling universal perturbations to be applied to activation maps within the intermediate layers of the model. With the aid of AP, we show that VP, by its input perturbation design, has intrinsic limitations in both performance and efficiency. By contrast, AP shares a natural connection to normalization tuning, e.g., batch normalization for convolutional neural networks (CNNs) and layer normalization for vision transformers (ViTs). This illuminates the reason behind the observed better accuracy of normalization tuning than VP in the literature. Furthermore, we show that the choice of prompting exhibits a distinct preference for layer depth, with conclusions varying significantly between CNNs and ViTs. We theoretically elucidate the rationale behind such preference by analyzing global features across layers. By conducting extensive experiments across 29 datasets and various model architectures, we provide a thorough performance analysis of AP, comparing it with VP and PEFT baselines. Our experimental results demonstrate that AP significantly surpasses the input-level VP in terms of both accuracy and efficiency, considering factors like time, parameters, memory usage, and throughout. These results further support our new insights into the incapabilities of VP and the capabilities of AP.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1399
Loading