Variational prompt tuning improves generalization of vision-language foundation models

Mohammad Mahdi Derakhshani; Enrique Sanchez; Adrian Bulat; Victor Guilherme Turrisi da Costa; Cees G. M. Snoek; Georgios Tzimiropoulos; Brais Martinez

Variational prompt tuning improves generalization of vision-language foundation models

Mohammad Mahdi Derakhshani, Enrique Sanchez, Adrian Bulat, Victor Guilherme Turrisi da Costa, Cees G. M. Snoek, Georgios Tzimiropoulos, Brais Martinez

Published: 04 Mar 2023, Last Modified: 16 May 2023ME-FoMo 2023 PosterReaders: Everyone

Keywords: Prompt Tuning, Foundation Models, Vision and Language Models, Variational Inference

TL;DR: we propose a probabilistic modeling of the underlying distribution of prompts, allowing prompts within the support of an associated concept to be derived through stochastic sampling.

Abstract: Using prompt tuning, large vision-language foundation models can be adapted to downstream tasks by treating part of the input language prompts as learnable parameters and freezing the rest. However, existing work on prompt tuning may damage the generalization capabilities of foundation models. To avoid such limitations, we propose a probabilistic modeling of the underlying distribution of prompts, allowing prompts within the support of an associated concept to be de- rived through stochastic sampling. This results in a more complete and richer transfer of the information captured by the language model, providing better generalization capabilities for downstream tasks. The resulting algorithm relies on a simple yet powerful variational framework that can be directly integrated with other developments. We show our approach is seamlessly integrated into both standard and conditional prompt learning frameworks, improving the performance in both cases considerably, especially with regard to preserving the generalization capability of the original model. Our method provides the current state-of-the-art for prompt learning, surpassing CoCoOp by 1.6% average Top-1 accuracy on the standard benchmark. Remarkably, it even surpasses the original CLIP model in terms of generalization to new classes. The implementation code will be released.

0 Replies

Loading