Surgical Flow Masked Autoencoder for Event Recognition

Mayar Lotfy Mostafa; Anna Alperovich; Dmitrii Fedotov; Ghazal Ghazaei; Stefan Saur; Azade Farshad; Nassir Navab

Surgical Flow Masked Autoencoder for Event Recognition

Mayar Lotfy Mostafa, Anna Alperovich, Dmitrii Fedotov, Ghazal Ghazaei, Stefan Saur, Azade Farshad, Nassir Navab

Published: 27 Mar 2025, Last Modified: 31 May 2025MIDL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Surgical Phase Recognition, Optical Flow, Masked Autoencoders, Adverse Events Recognition.

Abstract: Recognition and forecasting of surgical events from video sequences are crucial for advancing computer-assisted surgery. Surgical events are often characterized by specific tool-tissue interactions; for example, "bleeding damage" occurs when a tool unintentionally cuts a tissue, leading to blood flow. Despite progress in general event classification, recognizing and forecasting events in medical contexts remains challenging due to data scarcity and the complexity of these events. To address these challenges, we propose a method utilizing video masked autoencoders (VideoMAE) for surgical event recognition. This approach focuses the network on the most informative areas of the video while minimizing the need for extensive annotations. We introduce a novel mask sampling technique based on an estimated prior probability map derived from optical flow. We hypothesize that leveraging prior knowledge of tool-tissue interactions will enable the network to concentrate on the most relevant regions in the video. We propose two methods for estimating the prior probability map: (a) retaining areas with the fastest motion and (b) incorporating an additional encoding pathway for optical flow. Our extensive experiments on the public dataset CATARACTS and our in-house neurosurgical data demonstrate that optical flow-based masking consistently outperforms random masking strategies of VideoMAE in phase and event classification tasks. We find that an optical flow encoder enhances classification accuracy by directing the network's focus to the most relevant information, even in regions without rapid motion. Finally, we investigate sequential and multi-task training strategies to identify the best-performing model, which surpasses the current state-of-the-art by 5\% on the CATARACTS dataset and 27\% on our in-house neurosurgical data.

Primary Subject Area: Application: Ophthalmology

Secondary Subject Area: Application: Other

Paper Type: Both

Registration Requirement: Yes

Visa & Travel: Yes

Midl Latex Submission Checklist: Ensure no LaTeX errors during compilation., Created a single midl25_NNN.zip file with midl25_NNN.tex, midl25_NNN.bib, all necessary figures and files., Includes \documentclass{midl}, \jmlryear{2025}, \jmlrworkshop, \jmlrvolume, \editors, and correct \bibliography command., Did not override options of the hyperref package, Did not use the times package., All authors and co-authors are correctly listed with proper spelling and avoid Unicode characters., Author and institution details are de-anonymized where needed. All author names, affiliations, and paper title are correctly spelled and capitalized in the biography section., References must use the .bib file. Did not override the bibliographystyle defined in midl.cls. Did not use \begin{thebibliography} directly to insert references., Tables and figures do not overflow margins; avoid using \scalebox; used \resizebox when needed., Included all necessary figures and removed *unused* files in the zip archive., Removed special formatting, visual annotations, and highlights used during rebuttal., All special characters in the paper and .bib file use LaTeX commands (e.g., \'e for é)., Appendices and supplementary material are included in the same PDF after references., Main paper does not exceed 9 pages; acknowledgements, references, and appendix start on page 10 or later.

Latex Code: zip

Copyright Form: pdf

Submission Number: 237

Loading