Abstract: Focusing on information interactions of input images and questions, existing models solely utilize answer candidates as classification labels in medical Visual Question Answering (VQA) while ignoring semantic information inside answers. In addition, commonly used pretrained CNN visual encoders are difficult to extract representative features from low-resource medical images. To improve semantic learning and feature representation capability, this paper proposes an Inference Enhancement model for medical VQA, denoted as IE-VQA. The IE-VQA leverages a Convolutional Block Attention (CBA) mechanism to capture informative image features and generate representative multimodal representations. Besides, an Answer Refinement (AR) module is proposed to learn an informative answer embedding from answer candidates with a Dual Semantic Fusion (DSF) strategy. The embedding fuses with the multimodal representations to enrich them for enhancing contextual semantic learning capability of IE-VQA on image-question pairs. Experiments are conducted on three publicly available medical VQA datasets including VQA-RAD, VQA-SLAKE, and PathVQA. Results show that the proposed IE-VQA achieves competitive performances compared with state-of-the-art baseline models, indicating its capability to improve model inference on low-resource medical VQA data.
External IDs:dblp:journals/mms/LiYLWQH25
Loading