Enhancing GPT-3.5 for Knowledge-Based VQA with In-Context Prompt Learning and Image Captioning

Yuling Yang, Cong Cao, Fangfang Yuan, Shuai Zeng, Dakui Wang, Yanbing Liu

Published: 01 Jan 2024, Last Modified: 11 Apr 2025SMC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Traditional visual question answering (VQA) often falls short as merely relying on image information is insufficient to answer given questions. Therefore, Knowledge-Based Visual Question Answering (KB-VQA) has emerged. Typically, KB-VQA involves first retrieving knowledge from external knowledge bases, then using the retrieved knowledge in conjunction with the understanding of visual content for joint reasoning to predict answers. However, current models often suffer from weak visual perception capabilities when processing image information. Additionally, due to the incompleteness of external knowledge bases, retrieved knowledge may contain noise or even irrelevant information. Moreover, the re-embedding of knowledge text features during the model's reasoning process may deviate from the original meanings in the knowledge base. To address these challenges, we propose a method for Knowledge-Based Visual Question Answering (KB-VQA) using GPT-3.5, leveraging image captions and in-context prompts. We utilize an advanced captioning model to convert images into accurate textual representations, enhancing the large language model's understanding of visual information. Moreover, we eliminate the need for additional knowledge bases by directly employing GPT-3.5 as a knowledge base for knowledge retrieval and generate logically consistent text during inference to predict answers. Furthermore, we enhance GPT-3.5's question-answering capability for VQA through in-context prompt learning. Experiments on the public OK-VQA dataset demonstrate the superior performance of our model.