CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models

Yuan Yao; Ao Zhang; Zhengyan Zhang; Zhiyuan Liu; Tat-Seng Chua; Maosong Sun

CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models

Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun

Published: 28 Jan 2022, Last Modified: 22 Jun 2025ICLR 2022 SubmittedReaders: Everyone

Keywords: Pretrained Vision-language Models, Prompt Tuning, Visual Grounding

Abstract: Pre-Trained Vision-Language Models (VL-PTMs) have shown promising capabilities in grounding natural language in image data, facilitating a broad variety of cross-modal tasks. However, we note that there exists a significant gap between the objective forms of model pre-training and fine-tuning, resulting in a need for large amounts of labeled data to stimulate the visual grounding capability of VL-PTMs for downstream tasks. To address the challenge, we present Cross-modal Prompt Tuning (CPT, alternatively, Colorful Prompt Tuning), a novel paradigm for tuning VL-PTMs, which reformulates visual grounding into a fill-in-the-blank problem with color-based co-referential markers in image and text, maximally mitigating the gap. In this way, CPT enables strong few-shot and even zero-shot visual grounding capabilities of VL-PTMs. Comprehensive experimental results show that the prompt-tuned VL-PTMs outperform their fine-tuned counterparts by a large margin (e.g., 17.3% absolute accuracy improvement, and 73.8% relative standard deviation reduction on average with one shot in RefCOCO evaluation). All the data and codes will be available to facilitate future research.

One-sentence Summary: This paper proposes a cross-modal prompt tuning framework for pre-trained vision-language models.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/cpt-colorful-prompt-tuning-for-pre-trained/code)

17 Replies

Loading