Abstract: Automated detection of surgical steps is important for improving patient safety and surgical decision-making. Current state-of-the-art methods often focus on frame-level prediction or iterative refinement, overlooking the essential task of predicting at the segment level. This oversight neglects the simultaneous identification of steps in time and recognition of specific categories. In this paper, we present a novel multi-scale transformer-based approach to address this challenge. Our method classifies each frame in surgical videos and accurately estimates step boundaries. We extensively evaluate our approach on a two large dataset of untrimmed cataract surgery videos Cataract-101 and D99 and demonstrate superior performance compared to existing methods. Our results affirm the effectiveness of our approach for automated surgical step detection and recognition, emphasizing the importance of segment-level prediction for enhanced accuracy and practical application.
External IDs:dblp:conf/isbi/ShahSVP25
Loading