Keywords: Minimally Invasive Surgery, Skull Base Surgery, Surgical Instrument Detection, Pose Estimation
TL;DR: We propose MID-POSE and the PitSurg dataset for multi-class surgical instrument detection and pose estimation in endoscopic surgery.
Abstract: Reliable perception of surgical instruments is a key prerequisite for intraoperative guidance, context-aware assistance, and workflow analysis in minimally invasive surgery (MIS). This is particularly challenging in skull base procedures, where narrow anatomical corridors, frequent occlusions, specular highlights, and visually similar instruments make multi-class detection and 2D pose estimation difficult. We address joint instrument detection and keypoint-based pose estimation from monocular endoscopic videos and introduce MID-POSE, a dual-head architecture that couples a high-resolution HRNetV2p encoder with a class-agnostic dense detection--pose head and a Multi-level Instrument Classification (MIC) head which operates on RoI-pooled multi-level features. To support this task, we construct the PitSurg dataset from 26 clinical procedures, providing seven instrument classes with bounding boxes and detailed 2D keypoints. Using YOLOv8x-pose as our strongest baseline, which in our tasks outperforms YOLO11x-pose, MID-POSE improves Det/Pose $\text{AP}_{50\text{--}95}$ on PitSurg from 59.4/63.1 to 77.5/78.5 and on the robotic SurgPose dataset from 47.9/61.1 to 62.7/71.4. Qualitative analysis shows that high-resolution features sharpen localisation and keypoint placement, while the RoI classifier reduces misclassifications and spurious background detections, indicating that the proposed architecture and dataset provide an effective basis for robust multi-instrument perception in MIS.
Primary Subject Area: Detection and Diagnosis
Secondary Subject Area: Application: Endoscopy
Registration Requirement: Yes
Read CFP & Author Instructions: Yes
Originality Policy: Yes
Single-blind & Not Under Review Elsewhere: Yes
LLM Policy: Yes
Submission Number: 89
Loading