Injecting Textual Spatial Context into Vision-Language Models for Surgical Scene Understanding

15 Apr 2026 (modified: 16 Apr 2026)MIDL 2026 Short Papers SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Laparoscopic Surgery, Vision-Language Models, Spatial Awareness
TL;DR: This paper introduces SpatialContext, a simple way to inject anatomical scene geometry into vision-language models for laparoscopic multi-organ recognition.
Registration Requirement: Yes
Abstract: Accurate anatomical landmark identification is important for safe laparoscopic navigation, yet limited view and strong tissue similarity make multi-label organ classification diffi- cult. Existing vision-language models mainly rely on appearance and overlook the spatial structure of surgical scenes (Zhang et al., 2025). We propose SpatialContext, a multi- modal framework that injects scene geometry into classification through natural language prompts derived from segmentation masks, together with a context-conditional training strategy centered on the primary surgical target. Results on DSAD (Carstens et al., 2023) and Endoscapes (Mascagni et al., 2025) show improved recognition of scene-defining and off-target anatomy, suggesting that explicit spatial semantics can improve surgical scene understanding.
Visa & Travel: No
Read CFP & Author Instructions: Yes
Originality Policy: Yes
Single-blind & Not Under Review Elsewhere: Yes
LLM Policy: Yes
Submission Number: 109
Loading