Decoupling Vision and Reasoning: A Data-Efficient Pipeline for Surgical VQA

Mohamed Hamdy; Fatimaelzahraa Ali Ahmed; Muraam Abdel-Ghani; Muhammad Arsalan; Ponnuthurai Nagaratnam Suganthan; Khalid Al-Jalham; Abdulaziz Al-Ali; Shidin Balakrishnan

Decoupling Vision and Reasoning: A Data-Efficient Pipeline for Surgical VQA

Mohamed Hamdy, Fatimaelzahraa Ali Ahmed, Muraam Abdel-Ghani, Muhammad Arsalan, Ponnuthurai Nagaratnam Suganthan, Khalid Al-Jalham, Abdulaziz Al-Ali, Shidin Balakrishnan

Published: 14 Feb 2026, Last Modified: 14 Feb 2026MIDL 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Surgical VQA, Modular Vision-Language Models, Vision Language Models, Multi-modal Reasoning

Abstract: Vision-language models (VLMs) are becoming increasingly important for surgical intelligence, where reliable scene understanding requires combining visual perception with language-based reasoning. However, progress is constrained by the scarcity of high-quality multimodal datasets, making end-to-end training more prone to overfitting. Existing approaches often address this limitation by converting task-specific datasets (e.g., segmentation, phase recognition, tool-tissue interaction) into synthetic vision-question answering (VQA) form, but such conversions provide only sparse supervision and limit generalization. To overcome these challenges, we propose a modular pipeline that decouples vision information extraction from reasoning. Specialist surgical models--proven effective for their corresponding vision tasks--are first used to extract task-relevant signals, which are then transformed via heuristics into structured textual descriptions. These descriptions, together with the clinical question, are passed to a large language model (LLM) that performs the reasoning step and provides the answer. We evaluate this pipeline on the EndoVis-18-VQA benchmark under different configurations of specialist models and LLMs, showing that combining complementary experts yields stronger performance than relying on any single model. Our approach achieves higher accuracy, recall and F1 than existing surgical VQA baselines, with improvements of up to 2.3\% in accuracy without requiring multimodal training, establishing abstraction-driven modularity as a data-efficient and generalizable paradigm for surgical vision-language understanding.

Primary Subject Area: Detection and Diagnosis

Secondary Subject Area: Interpretability and Explainable AI

Registration Requirement: Yes

Visa & Travel: Yes

Read CFP & Author Instructions: Yes

Originality Policy: Yes

Single-blind & Not Under Review Elsewhere: Yes

LLM Policy: Yes

Submission Number: 164

Loading