Abstract: The Medical Visual Question Answering (MedVQA), a significant branch of Visual Question Answering, aims to identify relevant regions in medical images based on given questions and provide answers using medical knowledge. However, the relatively small size and limited variety of answers in existing mainstream MedVQA datasets hinder models from effectively learning medical knowledge. In this paper, we introduce a novel paradigm for addressing MedVQA by fine-tuning Large Language Models, and based on this approach, we propose a model named MedFLM to alleviate the issue of data scarcity. Specifically, MedFLM processes the question and the image features through the LLM to generate the answer. However, training all parameters would be computationally expensive; therefore, we employ an efficient parameter fine-tuning strategy to update only a subset of parameters. To enhance image feature extraction, we propose an innovative architecture that combines the strengths of Convolutional Neural Networks and Transformers. Experiments on three mainstream MedVQA datasets show that our model not only requires fewer computational resources but also achieves performance comparable to state-of-the-art models on VQA-RAD, SLAKE, and PathVQA.
External IDs:dblp:conf/icic/PengCSZ25
Loading