\section{Introduction}

% Mayar : will add in the figure reconstructed images
\begin{figure} [tb]
    \centering
    \includegraphics[trim=0 0 0 0, clip, width=0.7\linewidth]{assets/pipeline_MIDL_cameraready.drawio.png}
    \caption{Overview of the Flow Masked Autoencoder architecture. The model receives input frames $X_{T}$ and their corresponding optical flow frames $F_{T}$. The bottom left section shows the learned optical flow encoding mask, which is applied to the input before feeding the masked image into the encoder. Path (A) denotes the decoder head for reconstructing the input $X_{T}$, while Path (B) denotes the classification head for predicting surgical phases $Y_{T}$.}

    \label{fig:pipeline}
\end{figure}


Surgical workflow analysis provides valuable insights into the intricate sequence of events during surgical procedures. By understanding and analyzing these events, it is possible to enhance performance, optimize patient care, and improve training for medical professionals. It serves two main purposes: aiding intraoperative decision-making by recognizing the current surgical phase and guiding timely assistance, as well as enabling retrospective analysis for education, quality control, and workflow optimization. However, automated surgical workflow recognition is not widely adopted in operating rooms due to challenges related to robustness and reliability in complex surgical environments.


Operating room scenes are inherently intricate, often containing numerous irrelevant elements such as unused instruments, the surgeon’s hands, and holders that obscure the main regions of interest.
%
Prior studies on cataract surgery videos \cite{yu2019assessment} have demonstrated that leveraging tool information significantly enhances phase segmentation performance.
Similarly, DeepPhase \cite{zisimopoulos2018deepphase} highlights the importance of tool features in improving surgical workflow recognition. More recently, dynamic scene graphs have been employed to represent summaries of surgical scenes and specific tool-anatomy interactions for better surgical workflow recognition \cite{holm2023dynamic,koksal2024sangria}. However, these methods rely on highly detailed annotations of surgical scenes, which require substantial resources.
%
Building on these insights, we extend this approach by focusing not just on tool features but also on the critical tool-tissue interactions that define surgical workflows. These interactions capture the core dynamics of surgical procedures, making them invaluable for accurate workflow analysis.
Video Masked Autoencoder (VideoMAE)-based solutions have shown that masking strategies can effectively identify relevant regions in video data by learning robust spatiotemporal representations \cite{tong2022videomae}. However, existing approaches are often generic and fail to address the unique challenges posed by surgical videos, where key interactions are often localized and obscured by irrelevant details. 


In this work, we introduce \methodName{}, a novel optical flow-guided masking strategy that leverages tool-tissue interaction dynamics to enhance VideoMAE for surgical workflow analysis. 
Our approach introduces a smart masking strategy that leverages optical flow information to identify and focus on regions with influential motion, such as tool-tissue interactions while ignoring irrelevant areas.
%
Specifically, we develop two strategies for incorporating optical flow. First, we use its normalized magnitude directly, to create a masking probability map. Second, we incorporate an additional encoding pathway for optical flow, allowing the model to learn the most relevant regions in the scene. Finally, with both approaches we use the estimated map as a prior probability for mask sampling in MAE. This ensures that the masked autoencoder focuses on extracting meaningful features, improving downstream performance on such tasks as phase segmentation and adverse event classification.

We evaluate \methodName{} on two distinct surgical video datasets, highlighting its applicability across different surgical domains. \methodName{} achieves state-of-the-art (SOTA) performance on the task of phase segmentation on the CATARACTS dataset, outperforming methods that incorporate comprehensive surgical scene information through complex graph-based representations \cite{koksal2024sangria, holm2023dynamic}.
%
We demonstrate the generalizability and flexibility of our approach through evaluations on distinct surgical datasets, achieving up to 5\% improvement on the CATARACTS dataset and setting a new benchmark for adverse effect classification in Neurosurgery. 



