Segmentation-Informed Captioning: A Multi-Stage Pipeline for Surgical Vision–Language Dataset Generation

Published: 01 May 2025, Last Modified: 01 Jun 2025MIDL 2025 - Short PapersEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Surgical Captioning, Surgical Scene Understanding, Vision-Language Models
TL;DR: We present a five-stage pipeline that utilizes segmentation masks to generate descriptive surgical captions, enabling the creation of high-quality vision-language datasets for training models toward fine-grained surgical scene understanding.
Abstract: Developing models that understand surgical scenes across different procedures and tasks is critical for advancing generalizable surgical AI. Existing approaches often rely on vision-language models (VLMs), but their performance is limited by the quality of available datasets, which are noisy or misaligned--especially those relying on transcribed surgical audio. To address this, we propose a five-stage pipeline to construct more accurate and less noisy vision-language datasets from existing segmentation datasets. Our method applies rule-based heuristics to extract spatial, and interaction cues, which are then used to prompt a large language model (LLM) to produce naturally sound, clinically coherent captions. Evaluation by three medical experts on how well the captions met stage-specific expectations found that 95% of the generated captions scored 3 or higher on a Likert scale.
Submission Number: 77
Loading