Multiview Integration Network for Multitask Robotic Surgical Scene Analysis

Wenting Shen, Yaonan Wang, Min Liu, Jiazheng Wang, Renjie Ding, Zhe Zhang, Erik Meijering

Published: 01 Jan 2025, Last Modified: 23 Jul 2025IEEE Trans. Instrum. Meas. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Surgical scene analysis holds a pivotal role in robot-assisted surgery. However, existing methods often suffer from a single or few views, leading to erroneous scene analysis conclusions. To address these issues, a novel multiview integration network (MVINet) is proposed, which comprehensively analyzes the surgical scene by integrating effective information from multiple views, including global, local, dynamic, and static views. As a multitask scene analysis network, MVINet could simultaneously perform semantic segmentation of surgical instruments and detection of instrument-tissue interaction. By designing a unique global-local spatial feature memory module (G-LSFM), the joint information from globally and locally analyzed views, following graph analysis, simultaneously enhances the accuracy of multitask scene analysis. The dynamic-static visual feature memory module (D-SVFM) introduces an innovative approach by simultaneously incorporating temporal features from consecutive frames and static features from a single frame into the node features of the interaction reasoning network. The multiscale perspective of both features further enhances the ability of the module to analyze complex scenes. Experimental results on a public and a private dataset demonstrate that our method achieves superior performance compared to other state-of-the-art (SOTA) methods for two crucial tasks in surgical scene analysis. In the instrument segmentation task, MVINet outperforms the second-best method by 1.73% and 0.55% in terms of mIoU scores. For the interaction detection task, MVINet surpasses the second-best method by 10.14% and 5.47% in mean average precision (mAP) scores.