Applying Prompts and Parameter-Efficient Methods to Enhance Single-Stream Vision-Language Transformers

Xuehao Liu, Sarah Jane Delany, Susan McKeever

Published: 01 Jan 2024, Last Modified: 14 Nov 2024VISIGRAPP (2): VISAPP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Large-Scale transformer models pose challenges due to resource-intensive training, time, and data requirements for fine-tuning on new tasks, mainly due to their extensive parameter count. To address this, zero-shot and few-shot learning, aided by techniques like prompts and parameter-efficient modules, have emerged. However, these techniques are often tailored for vision-only or language-only tasks, leaving a gap for their effectiveness in multi-modal tasks like image captioning. This paper explores the effectiveness of prompts and parameter-efficient modules in reducing the training effort for image captioning. Rather than extensive fine-tuning, we trained only the prompt and parameter-efficient modules on the pretrained Oscar transformer model using the COCO dataset. We tested five prompt tuning approaches and two parameter-efficient methods. Notably, combining visual prompt tuning(VPT) with Adapter and LoRA led to a 2% Cider score improvement after just one epoch training, with a