RoI-MedCap: Region of Interest-Based Medical Image Captioning with Multi-Stream Connector

Al Shahriar Rubel; Frank Shih; Fadi Deek

RoI-MedCap: Region of Interest-Based Medical Image Captioning with Multi-Stream Connector

Al Shahriar Rubel, Frank Shih, Fadi Deek

Published: 19 Aug 2025, Last Modified: 12 Oct 2025BHI 2025EveryoneRevisionsBibTeXCC BY 4.0

Confirmation: I have read and agree with the IEEE BHI 2025 conference submission's policy on behalf of myself and my co-authors.

Keywords: Medical Image Captioning, Radiology Report Generation, Vision Language Model (VLM), Large Language Model (LLM), Region of Interest (RoI), Multi-Stream Connector, Cross Attention, Artificial Intelligence

TL;DR: This paper presents an architecture with a vision encoder, a novel Multi-Stream Connector, and an LLM to generate structured captions for medical images with integrated RoI, using an efficient training strategy that trains only the connector.

Abstract: Medical image captioning has gained significant attention due to the rapid advancements in Artificial Intelligence. However, existing research primarily focuses on global image captioning, lacking a mechanism for Region of Interest (RoI)-based captioning where users can specify an area and receive a caption centered on that specific region. In this paper, we propose a novel architecture with a vision encoder, a connector, and a Large Language Model (LLM) to generate captions for medical images with integrated RoI. We introduce a Multi-Stream Connector (MSC) to project visual features from a vision encoder to a representation that helps the LLM generate captions centered on a specified region of an image indicated by a bounding box. We aim to generate captions with three aspects including the modality and structure, RoI analysis and lesion findings in RoI, and local-global relationship denoting impacts of findings in RoI to other regions. To achieve this goal, MSC incorporates three Cross Attentions focusing on three different aspects of the generated captions. Our extensive experiments demonstrate that our method is well capable of generating captions highly aligned with human judgement, compared to existing related methods.

Track: 3. Imaging Informatics

Registration Id: V7NTYPCFDLB

Submission Number: 178

Loading