Enhancing Pre-trained ViTs for Downstream Task Adaptation: A Locality-Aware Prompt Learning Method

Shaokun Wang; Yifan Yu; Yuhang He; Yihong Gong

Enhancing Pre-trained ViTs for Downstream Task Adaptation: A Locality-Aware Prompt Learning Method

Shaokun Wang, Yifan Yu, Yuhang He, Yihong Gong

Published: 20 Jul 2024, Last Modified: 06 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Vision Transformers (ViTs) excel in extracting global information from image patches. However, their inherent limitation lies in effectively extracting information within local regions, hindering their applicability and performance. Particularly, fully supervised pre-trained ViTs, such as Vanilla ViT and CLIP, face the challenge of locality vanishing when adapting to downstream tasks. To address this, we introduce a novel LOcality-aware pRompt lEarning (LORE) method, aiming to improve the adaptation of pre-trained ViTs to downstream tasks. LORE integrates a data-driven Black Box module (i.e.,a pre-trained ViT encoder) with a knowledge-driven White Box module. The White Box module is a locality-aware prompt learning mechanism to compensate for ViTs’ deficiency in incorporating local information. More specifically, it begins with the design of a Locality Interaction Network (LIN), which treats an image as a neighbor graph and employs graph convolution operations to enhance local relationships among image patches. Subsequently, a Knowledge-Locality Attention (KLA) mechanism is proposed to capture critical local regions from images, learning Knowledge-Locality (K-L) prototypes utilizing relevant semantic knowledge. Afterwards, K-L prototypes guide the training of a Prompt Generator (PG) to generate locality-aware prompts for images. The locality-aware prompts, aggregating crucial local information, serve as additional input for our Black Box module. Combining pre-trained ViTs with our locality-aware prompt learning mechanism, our Black-White Box model enables the capture of both global and local information, facilitating effective downstream task adaptation. Experimental evaluations across four downstream tasks demonstrate the effectiveness and superiority of our LORE.

Primary Subject Area: [Content] Vision and Language

Secondary Subject Area: [Experience] Multimedia Applications

Relevance To Conference: First, we propose a novel LOcality-aware pRompt lEarning method (LORE) consisting of a data-driven Black Box module and a semantic knowledge-driven White Box module for downstream task adaptation. Second, to mitigate the problem of locality vanishing in pre-trained ViT models, we design a locality-aware prompt learning mechanism as our White Box module to compensate for the limited local information incorporating capacity of pre-trained ViTs. Third, we develop a Knowledge-Locality Attention (KLA) mechanism to capture critical local regions from images. KLA learns K-L prototypes of images utilizing a semantic knowledge-locality matching strategy, which are then leveraged to optimize the training of our Prompt Generator (PG). Finally, experimental results on 4 kinds of downstream tasks, including 16 benchmark datasets, demonstrate the superiority of the proposed LORE method.

Supplementary Material: zip

Submission Number: 2070

Loading