Keywords: Multimodal, Image Captioning, Dual Image Encoders
Abstract: With the rise of multi-modality and large language models, deep learning-based methods for medical image captioning show great promise in providing diagnostic insights. However, existing general-purpose text and image pretrained models struggle to accurately describe the complex details of images. Our project focus on an image captioning approach that leverages the Segment Anything Model (SAM) to enhance feature encoding by capturing both general and detailed features. Moreover, we propose a pretraining strategy based on mixed semantic learning, enabling the model to effectively capture both high-level context and fine-grained details. Experimental results demonstrate that our method surpasses the performance of the pretrained BLIP2 model across multiple metrics in generating medical image descriptions.
Submission Number: 6
Loading