Meta-prompt tuning for low-resource visual question answering

Published: 01 Jan 2025, Last Modified: 25 Jul 2025Multim. Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recently, fine-tuning pre-trained Vision Language Models (VLM) has achieved significant success in Low-resource Visual Question Answering (LVQA) tasks. However, existing works have failed to match questions with their corresponding image features, resulting in reduced question-answering accuracy due to the lack of detailed analysis of fine-grained features. To mitigate these issues, we propose a Meta-Prompt Tuning (MPT) approach in this paper, which aims to enable the model to understand and analyze diverse questions by incorporating relevant image regions, thereby producing accurate answers using a limited amount of data. Specifically, in order to enhance the model’s ability to handle information from images and questions, we have devised a dual-loop training framework. In the inner loop, specific instructions are provided to assist the model in processing different types of questions, while in the outer loop, the model accumulates general knowledge across various question-answer pairs. Furthermore, to perform a detailed analysis of the current question and focus on relevant visual features, we have designed Meta-prompt Generation modules and Dynamic Routers to parse input content information and dynamically combine meta-prompts according to requirements. Experimental results on standard LVQA datasets demonstrate the effectiveness of our proposed method compared to other approaches, and the accuracy across various question-answer types has improved significantly.
Loading