Abstract: Medical Visual Question Answering (Med-VQA) is a domain-specific task that answers a given clinical question regarding a radiology image. It requires sufficient prior medical knowledge, resulting in additional challenges compared to general VQA tasks. However, the lack of well-annotated large-scale datasets makes it hard to learn sufficient medical knowledge for Med-VQA. To address the challenge, this paper employs a large-scale medical multi-modal dataset to pre-train and fine-tune an effective model, denoted by ROCOGLoRIA. The model can locate semantic-rich regions implied in medical texts and extract local semantic-focusing visual features from the image. We propose to combine the global visual features with the weighted local visual features, for capturing fine-grained semantics in the image. We further incorporate ROCOGLoRIA as the visual encoder into baselines, to investigate whether it benefits Med-VQA. We conduct extensive experiments on three benchmark datasets and the results show that the method using ROCOGLoRIA as a pre-trained visual encoder outperforms strong baselines in the overall accuracy.
Loading