Cognitive-Awakening Chain-of-Surgery for Compositional Zero-Shot Surgical Triplet Recognition

16 Sept 2025 (modified: 27 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Surgical Scene Understanding, Surgical Triplet Recognition, Compositional Zero-Shot Learning
Abstract: Compositional Zero-shot Surgical Triplet Recognition (CZSTR) is a challenging task that requires models to recognize novel combinations of <instrument, verb, target> that never co-occurred during training. This task captures the inherent generalization requirement in real surgical procedures. Large Vision-Language Models (LVLMs) with Chain-of-Thought (CoT), as one of the most advanced methods, are limitedly exposed to sufficient surgical semantics, leading to a shortage on the CZSTR task. To tackle this, we explore a more intuitive and natural human-like reasoning framework, which is introduced as Cognitive-awakening Chain-of-Sugery (CoCoS). CoCoS mirrors the way surgeons think: it starts by glancing at the scene, then gazing at the operation process over time, and finally drawing structured conclusions. Such a step-by-step cognitive-awakening process reflects how we naturally interpret surgical procedures and instruct large vision-language models (LVLMs) to deeply understand surgical scenes. Observing that LVLMs often hallucinate on relatively simple subtasks, e.g., identifying instruments, we further propose a Multimodal image–Sequence–Text (MiST) fusion module to reinforce the stability of the framework. To evaluate our framework, we also develop a strategy to reorganize existing surgical triplet datasets into a compositional zero-shot benchmark. Experiments show that our framework improves generalization to unseen triplets, outperforming both traditional models and LVLMs under this challenging task.
Supplementary Material: pdf
Primary Area: causal reasoning
Submission Number: 6546
Loading