PLMAS: Adaptive Sample Selection for Prompting LLMs in Knowledge-Based Visual Question Answering

Jian Li, Quanxing Xu, Ling Zhou, Feifei Zhang, Rubing Huang

Published: 31 Mar 2026, Last Modified: 15 May 2026ACM Transactions on Multimedia Computing, Communications, and ApplicationsEveryoneRevisionsCC BY-SA 4.0

Abstract: With the rapid advancement of large-scale model technology, Visual Question Answering (VQA)—a core subfield of multimodal research—increasingly relies on these models to address complex challenges. This trend is especially evident in Knowledge-based VQA (KB-VQA), which requires integrating external knowledge. While most studies approach KB-VQA using explicit or implicit knowledge bases, recent studies employ in-context learning to guide large language models (LLMs) with implicit knowledge (e.g., PICa and Prophet). However, existing sample selection strategies for in-context learning are oversimplified and fail to adequately leverage the tacit knowledge encoded within LLMs. To address this limitation, we propose an adaptive sample selection strategy that integrates triple similarity calculations (question-image, question-caption, and question-pre-answer) and dynamically assembles the most relevant samples using weighted combinations, thereby effectively activating the large model’s implicit knowledge. To evaluate the performance of our proposed approach, we conducted experiments on benchmark datasets. Results demonstrate that our method (PLMAS) achieves state-of-the-art performance on both the OK-VQA and A-OKVQA datasets.

External IDs:doi:10.1145/3777476