Prompting Large-Scale Vision Models with Cascaded Semantics

Prompting Large-Scale Vision Models with Cascaded Semantics

09 Feb 2026 (modified: 01 Mar 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: As a leading parameter-efficient tuning paradigm in NLP, prompt tuning has recently been explored for its potential in computer vision. Unlike updating pre-trained large-scale models (e.g., vision transformer, or ViT for short), visual prompt tuning (VPT) incorporates additional learnable parameters (i.e., prompt) that are updated during tuning. However, original visual prompts are randomly initialized, without leveraging the power of prior knowledge, which has been frequently used in NLP (e.g., instruction). To bridge this gap, we propose a novel methodology, aiming to inject semantic prior to prompt the tuning. To this end, we pioneer in leveraging both fundamental image prior and advanced image semantics as such priors. The former, including color, texture, and shape, are extracted by classical hand-crafted operators, suitable for the input space, while the self-attention map is utilized as the latter, suitable for the feature space. We propose a scheme to integrate the two types of semantic priors into ViT's tuning through cascading. Extensive experiments conducted on 34 challenging image classification datasets demonstrate the superiority of our method in adapting pre-trained ViTs to various downstream scenarios while using only 0.74\% of ViT parameters as tuned.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Matias_Alejandro_Valdenegro-Toro1

Submission Number: 7421

Loading