Decoupling Vision and Reasoning: A Data-Efficient Pipeline for Surgical VQA

30 Nov 2025 (modified: 15 Dec 2025)MIDL 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Surgical VQA, Modular Vision-Language Models, Vision Language Models, Multi-modal Reasoning
Abstract: Vision-language models (VLMs) are becoming increasingly important for surgical intelligence, where reliable scene understanding requires combining visual perception with language-based reasoning. However, progress is constrained by the scarcity of high-quality multimodal datasets, making end-to-end training more prone to overfitting. Existing approaches often address this limitation by converting task-specific datasets (e.g., segmentation, phase recognition, tool-tissue interaction) into synthetic vision-question answering (VQA) form, but such conversions provide only sparse supervision and limit generalization. To overcome these challenges, we propose a modular pipeline that decouples vision information extraction from reasoning. Specialist surgical models--proven effective for their corresponding vision tasks--are first used to extract task-relevant signals, which are then transformed via heuristics into structured textual descriptions. These descriptions, together with the clinical question, are passed to a large language model (LLM) that performs the reasoning step and provides the answer. We evaluate this pipeline on the EndoVis-18-VQA benchmark under different configurations of specialist models and LLMs, showing that combining complementary experts yields stronger performance than relying on any single model. Our approach achieves higher accuracy, recall and F1 than existing surgical VQA baselines, with improvements of up to 2.3\% in accuracy without requiring multimodal training, establishing abstraction-driven modularity as a data-efficient and generalizable paradigm for surgical vision-language understanding.
Primary Subject Area: Detection and Diagnosis
Secondary Subject Area: Interpretability and Explainable AI
Registration Requirement: Yes
Visa & Travel: Yes
Read CFP & Author Instructions: Yes
Originality Policy: Yes
Single-blind & Not Under Review Elsewhere: Yes
LLM Policy: Yes
Submission Number: 164
Loading