Abstract: Pretraining and finetuning large vision-language models (VLMs) have achieved remarkable success in visual question answering (VQA). However, finetuning VLMs requires heavy computation, expensive storage costs, and is prone to overfitting for VQA in low-resource settings. Existing prompt tuning methods have reduced the number of tunable parameters, but they cannot capture valid context-aware information during prompt encoding, resulting in 1) poor generalization of unseen answers and 2) lower improvements with more parameters. To address these issues, we propose a prompt tuning method for low-resource VQA named Adaptive Self-Prompt Tuning (Self-PT), which utilizes representations of question-image pairs as conditions to obtain context-aware prompts. To enhance the generalization of unseen answers, Self-PT uses dynamic instance-level prompts to avoid overfitting the correlations between static prompts and seen answers observed during training. To reduce parameters, we utilize hyper-networks and low-rank parameter factorization to make Self-PT more flexible and efficient. The hyper-network decouples the number of parameters and prompt length to generate flexible-length prompts by the fixed number of parameters. While the low-rank parameter factorization decomposes and reparameterizes the weights of the prompt encoder into a low-rank subspace for better parameter efficiency. Experiments conducted on VQA v2, GQA, and OK-VQA with different low-resource settings show that our Self-PT outperforms the state-of-the-art parameter-efficient methods, especially in lower-shot settings, e.g., 6% average improvements cross three datasets in 16-shot. Code is available at https://github.com/NJUPT-MCC/Self-PT.
0 Replies
Loading