everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
Recent advancements in multimodal large language models (MLLMs) have achieved strong performance in vision-language tasks such as visual question answering (VQA). However, these models struggle with knowledge-intensive VQA (KI-VQA) tasks that require fine-grained domain knowledge, as seen in benchmarks such as Encyclopedic VQA and InfoSeek. To address these challenges, we propose a novel retrieval-augmented generation (RAG) framework, referred to as KIRA, designed to enhance the capability of MLLMs for KI-VQA without task-specific fine-tuning. Our target is to integrate general image-text similarity with detailed knowledge context to achieve precise entity recognition. To this end, we leverage CLIP to obtain general image-text matching, and design a verification mechanism according to detailed question-text relevance to improve recognition accuracy. We evaluate our method on KI-VQA benchmarks, demonstrating significant improvements of 47.5% on Encyclopedic VQA and 16.2% on InfoSeek, all achieved without additional training. These results highlight the potential of our training-free, plug-and-play framework for solving knowledge-intensive visual question answering tasks.