\section{Experiments \& Results}

The foundational architecture of our model is derived from VideoMAE \cite{tong2022videomae}, specifically employing the ViT-Small backbone. We conduct experiments with two distinct variations of this model: the first is pretrained on the Kinetics-400 Action dataset, followed by fine-tuning on our specific use case dataset. The second variation involves training the reconstruction model from scratch.



\subsection{Datasets} 

In our experiments, we use three medical datasets: CATARCTS \cite{al2019cataracts} for fine-grained surgical activity recognition, an in-house Neurosurgical dataset focusing on bleeding-related adverse events, and Ego-Surgery dataset \cite{fujii2024egosurgery} for general phases in egocentric open surgery.

\paragraph{CATARACTS}  
The dataset contains 50 cataract surgery videos, each at 1920 × 1080 pixels and 30 fps, annotated with 19 surgical phases. It is split into 25 training, 5 validation, and 20 testing videos, consistent with prior work \cite{koksal2024sangria} for fair comparison.
%


\paragraph{Microscopic Neurosurgery}
\label{Dataset:neuro}
The dataset for this study consists of 12 neurosurgical videos recorded at a resolution of 1920 x 1080 pixels and a frame rate of 60 fps. Annotations focus on two classes: "adverse bleeding event" and "non-adverse event".
%
Due to the nature of the surgical scene, where bleeding is common—particularly from the opening of the dura mater—such instances are not classified as adverse events. Adverse events are specifically annotated when unintentional damage occurs due to tool-tissue interaction, necessitating immediate surgical intervention. This task is more complex than merely distinguishing between bleeding and non-bleeding scenarios. In our dataset, there are 205 occurrences of adverse events, while the remaining 1,673 sequences are categorized as normal events. The dataset is divided into training, validation, and test sets, comprising 70\%, 15\%, and 15\% of the total data, respectively. A patient-wise split is implemented to prevent data leakage, ensuring that no patient appears in more than one subset, thereby enhancing the model's generalizability.



\subsection{Experimental Setup}
\paragraph{Implementation Details}
We utilize the ViT-Small backbone with an input patch size of \( (16, 16) \) for all models. The input video and optical flow are processed at a resolution of \( 224 \times 224 \) pixels, comprising 16 frames with a sampling rate of 2.
For pre-training, in accordance with best practices established in prior research, the sampling ratio of input tokens is fixed at 90\%. 
Additional implementation details can be found in \autoref{appendix:implementation_details}. 


\paragraph{Evaluation Metrics}
We assess our method using five benchmark metrics for surgical phase recognition and event classification: Accuracy (Acc1), Top-5 Accuracy (Acc5), Precision, Recall, and Jaccard index. We report micro-average accuracy for CATARACTS to compare with SOTA, while macro-average is used for other metrics to ensure equal importance across smaller classes.


\subsection{Quantitative and Qualitative Results}
\paragraph{CATARACTS}


\input{tables/sota_cataracts}

\input{tables/masking_experiments_combined}


In \autoref{tab:sota_cataracts}, we compare recent methods and SOTA approaches for phase segmentation on the CATARACTS dataset. The evaluated methodologies include long-range temporal learning techniques using DINO-TCN++ \cite{koksal2024sangria}, which features a two-step model with DINO as a feature extractor followed by temporal classification. This is compared to our end-to-end variation that jointly trains Xception and TCN++.
Graph-based approaches from \cite{holm2023dynamic} and \cite{koksal2024sangria} examine static versus dynamic scene graphs. We also assess the masked encoder approach from \cite{tong2022videomae}, which serves as the foundation for our work. Our results show that the proposed \methodName{} achieves competitive performance, with the multitask model outperforming the SOTA by 5\% (4 points) in accuracy. 



In \autoref{tab:masking_experiments_combined}, we analyze results from various model configurations, distinguishing between two-step models that separately learn reconstruction and classification tasks and the multitask model. Here, ` Mask' denotes different training strategies used for training and testing the models. \textit{Flow} refers to masking simply based on the flow map input (\autoref{eq:1}), while \textit{Encoder} is the extended version, where masking is performed based on the estimated probability of the input flow maps (\autoref{eq:2}).
%
We assess the impact of masking types on both models trained from scratch and those fine-tuned from pretrained versions.
%
Our analysis reveals three key insights. First, fine-tuning pretrained models from different domains (K400) benefits both setups.
%
Second, flow-based masking consistently outperforms random tube masking, which selects the same patches for all frames in a sequence. This results in about 8\% (5.90 points) improvement in accuracy (Acc1) for the K400+Cataracts configuration in the Rec+Cls task and a 2.5\% (2 points) enhancement in multitask mode. Third, the multitask approach consistently surpasses two-step models, showing that learning reconstruction aids phase recognition. Extended results are shown in \autoref{tab:all_experiments_cataracts}.

% 
\paragraph{Microscopic Neurosurgery}
Identifying adverse events in neurosurgery, where only surgical videos are available, poses significant challenges, as detailed in \autoref{Dataset:neuro}. In \autoref{fig:neuro_adverse_examples}, we illustrate that not all bleeding instances are damaging events.
%
While previous research has focused on bleeding detection in endonasal surgery \cite{pangal2022expert} and neurosurgical craniotomy \cite{tang2022bleeding}, we are the first to introduce adverse recognition in microscopic neurosurgical videos.
%
We utilize the multitask strategy for our analysis, which has proven most effective in the CATARACTS study presented in \autoref{tab:masking_experiments_combined}. Given that our dataset is approximately 2.5 times larger in duration, we choose to train the model from scratch. This decision is supported by the ablation study shown in \autoref{tab:all_neuro_experiments}.  Damage events are closely linked to tool-tissue interactions, as hypothesized. This is supported by substantial improvements in flow masking, which targets regions of interest.
In \autoref{tab:neuro_multitask_results}, we present results from patient-wise cross-validation using three non-intersecting splits. All three splits achieved an accuracy above 79.4\%, with the best split reaching 89.5\%. Across all splits, the optical flow and encoder-based masking methods consistently outperformed the baseline, a trend that is also observed in the experiments on the Egosurgery dataset, as reported in \autoref{tab:phase_recognition_egosurgery}.
\begin{figure}[tb]
  \centering
    \includegraphics[trim=0 160 190 0, clip, width=0.7\linewidth]{assets/reconstruction_output.jpg}
    \caption{Visualization of the reconstruction output from the multitask flow model, organized into four rows: input RGB, masked RGB, optical flow input, and reconstructed output. Examples are shown for both CATARACTS and Neurosurgery.}
    \label{fig:qual_eval_recons_output}
\end{figure}
We further conduct a qualitative assessment of the reconstruction output from the multitask model, as shown in \autoref{fig:qual_eval_recons_output}. The results indicate that masking 50\% of the image while retaining key regions based on optical flow enhances the model's ability to reconstruct contextual information and the interactions between tissue and the surgical tool.


\paragraph{Inference Analysis} \autoref{tab:inference_speed_analysis} shows the runtimes for each 16-frame video sequence. While we are not targeting real-time applications, our neurosurgical videos are downsampled to 5 fps (200 ms between frames), making runtimes of 13-40 ms suitable for real-time scenarios, provided hardware requirements are met.
\begin{table}[tb]
    \centering
    \caption{Inference Speed Analysis for Multitask Models}
    \resizebox{0.35\linewidth}{!}{ % Resize to 60% of the line width
        \begin{tabular}{ccc}
            \hline
            \textbf{Random} & \textbf{Flow} & \textbf{Encoding} \\ 
            \hline
            13 ms & 13.6 ms & 40 ms \\ 
            \hline
        \end{tabular}
    }
    \label{tab:inference_speed_analysis}
\end{table}








