Seek Inner: LLM-Enhanced Information Mining for Medical Visual Question Answering

Published: 2025, Last Modified: 14 Jan 2026WWW (Companion Volume) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Medical visual question answering (Med-VQA) focuses on analyzing medical images to accurately respond to clinicians' specific questions. Although integrating prior knowledge can enhance VQA reasoning, current methods often struggle to extract relevant information from the vast and complex medical knowledge base, thereby limiting the models' ability to learn domain-specific features. To overcome this limitation, our study presents a novel information mining approach that leverages large language models (LLMs) to efficiently retrieve pertinent data. Specifically, we design a latent knowledge generation module that employs LLMs to separately extract and filter information from questions and answers, enhancing the model's inference capabilities. Furthermore, we propose a multi-level prompt fusion module in which an initial prompt interacts with the extracted latent knowledge to draw clinically relevant details from both unimodal and multimodal features. Experimental results demonstrate that our approach outperforms current state-of-the-art models on multiple Med-VQA benchmark datasets.
Loading