Abstract: Medical image captioning plays an important role in
modern healthcare, improving clinical report generation
and aiding radiologists in detecting abnormalities and reducing misdiagnosis. The complex visual and textual data
biases make this task more challenging. Recent advancements in transformer-based models have significantly improved the generation of radiology reports from medical
images. However, these models require substantial computational resources for training and have been observed
to produce unnatural language outputs when trained solely
on raw image-text pairs. Our aim is to generate more detailed reports specific to images and to explain the reasoning behind the generated text through image-text alignment. Given the high computational demands of end-to-end
model training, we introduce a two-step training methodology with an Intelligent Visual Encoder for Bridging Modalities in Report Generation (InVERGe) model. This model
incorporates a lightweight transformer known as the CrossModal Query Fusion Layer (CMQFL), which utilizes the
output from a frozen encoder to identify the most relevant
text-grounded image embedding. This layer bridges the
gap between the encoder and decoder, significantly reducing the workload on the decoder and enhancing the alignment between vision and language. Our experimental results, conducted using the MIMIC-CXR, Indiana University chest X-ray images, and CDD-CESM breast images
datasets, demonstrate the effectiveness of our approach.
Loading