Segmentation-Informed Captioning: A Multi-Stage Pipeline for Surgical Vision–Language Dataset Generation

Mohamed Hamdy; Fatmaelzahraa Ali Ahmed; Mariam Ahmed; Mohannad Natheef AbuHaweeleh; Muraam Abdel-Ghani; Muhammad Arsalan; Abdulaziz Al-Ali; Shidin Balakrishnan

Segmentation-Informed Captioning: A Multi-Stage Pipeline for Surgical Vision–Language Dataset Generation

Mohamed Hamdy, Fatmaelzahraa Ali Ahmed, Mariam Ahmed, Mohannad Natheef AbuHaweeleh, Muraam Abdel-Ghani, Muhammad Arsalan, Abdulaziz Al-Ali, Shidin Balakrishnan

Published: 01 May 2025, Last Modified: 01 Jun 2025MIDL 2025 - Short PapersEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Surgical Captioning, Surgical Scene Understanding, Vision-Language Models

TL;DR: We present a five-stage pipeline that utilizes segmentation masks to generate descriptive surgical captions, enabling the creation of high-quality vision-language datasets for training models toward fine-grained surgical scene understanding.

Abstract: Developing models that understand surgical scenes across different procedures and tasks is critical for advancing generalizable surgical AI. Existing approaches often rely on vision-language models (VLMs), but their performance is limited by the quality of available datasets, which are noisy or misaligned--especially those relying on transcribed surgical audio. To address this, we propose a five-stage pipeline to construct more accurate and less noisy vision-language datasets from existing segmentation datasets. Our method applies rule-based heuristics to extract spatial, and interaction cues, which are then used to prompt a large language model (LLM) to produce naturally sound, clinically coherent captions. Evaluation by three medical experts on how well the captions met stage-specific expectations found that 95% of the generated captions scored 3 or higher on a Likert scale.

Submission Number: 77

Loading