Abstract: With the rapid advancement of large-scale model technology, Visual Question Answering (VQA)—a core subfield of multimodal research—increasingly relies on these models to address complex challenges. This trend is especially evident in Knowledge-based VQA (KB-VQA), which requires integrating external knowledge. While most studies approach KB-VQA using explicit or implicit knowledge bases, recent studies employ in-context learning to guide large language models (LLMs) with implicit knowledge (e.g., PICa and Prophet). However, existing sample selection strategies for in-context learning are oversimplified and fail to adequately leverage the tacit knowledge encoded within LLMs. To address this limitation, we propose an adaptive sample selection strategy that integrates triple similarity calculations (question-image, question-caption, and question-pre-answer) and dynamically assembles the most relevant samples using weighted combinations, thereby effectively activating the large model’s implicit knowledge. To evaluate the performance of our proposed approach, we conducted experiments on benchmark datasets. Results demonstrate that our method (PLMAS) achieves state-of-the-art performance on both the OK-VQA and A-OKVQA datasets.
External IDs:doi:10.1145/3777476
Loading