\section{Methodology}

Our proposed VideoMAE-based architecture has two parts: the autoencoder, reconstructing input video frames, and the downstream pathway classifying surgical events. They can be trained jointly or sequentially. The overall pipeline is illustrated in \autoref{fig:pipeline}.
%
First, we outline our \methodName{} methodology and introduce our novel masking strategy. Then, we describe the multitask approach and conclude with our objective functions.


\subsection{Definitions}

We are given a dataset of video sequences $V = \{X_1, ..., X_T \}$, and event labels $Y = \{y_1, ..., y_T \}$, where $T$ is the total number of frames in $V$, and $X_t \in \mathcal{R}^{H \times W \times C}$ is the video frame at time step $t$ with $H, W, C$ defining the frame's width, height and the number of channels. We estimate the magnitude of the optical flow $F_t \in \mathcal{R}^{H \times W}$ for each frame $X_t$. From now on, we will refer to it as optical flow.
%
Our \methodName{} model, defined by $f_{\theta}$ and parameterized by $\theta$, receives pairs of video frames and optical flow and outputs the reconstructed video together with the predicted class label.

\begin{equation}
    V', Y' = f_{\theta}(V, F)
    \label{eq:1}
\end{equation}

%
\paragraph{Optical Flow Estimation}
The optical flow is precomputed using the SEA-RAFT algorithm \cite{wang2025sea} for each frame, providing a robust measure of motion and activity. We calculate it by analyzing frame differences over a temporal window of 1 second. Optical flow magnitude \( F \) is defined as $ F = \sqrt{u^{2}+v^{2}}$, where \( u \) and \( v \) represent the horizontal and vertical components, respectively. 

\subsection{Video Masked Autoencoder}
\label{sec:architecture}


\paragraph{Mask Sampling}
\label{sec: sampling_strategy}


\begin{figure}[tb]
    \centering
    \includegraphics[trim=5 120 90 0, clip, width=0.8\linewidth]{assets/MIDL_figure2.jpg}
    \caption{Visualization of Sampling Strategies: 
Subfigure (A) provides an overview of sampling techniques, displaying (from top to bottom) RGB images, optical flow representations, random tube masking, and flow masking. The left side shows CATARACTS, while the right side presents Neurosurgical data. Subfigure (B) illustrates the encoder's impact on feature representation, demonstrating examples where significant features do not always correspond to areas of highest motion. RGB images, optical flow, and encoded features arranged in columns.}


    \label{fig:sampling_strategy}
\end{figure}



In conventional Masked Autoencoders, mask generation typically involves random patch selection. 
%
We hypothesize that regions with high motion, hence, large optical flow, carry important information for recognizing surgical actions.
%
Thus, we propose a new sampling strategy (\textit{cf.} \autoref{fig:sampling_strategy}) based on the estimated prior probability distribution \( P \). The choice of retaining or removing the regions according to the probability map from the frame \( X_{t} \) depends on the training strategy, sequential or multitask. 

%

Each frame \(X_t\) is divided into a set of non-intersected patches  \( B \).  The sampling process involves drawing \( k \) patches from \( B \) according to the probability distribution \( P \),  $ B_{k} \sim P \quad $.
%
We compare two approaches for calculating probability distribution. The first focuses on patches with higher motion dynamics. To increase reconstruction task complexity, we encourage the network to attend to masked regions, enhancing the pretrained model's effectiveness. We use min-max normalization $\left\| F \right\| =\frac{ F}{F_{max}+\epsilon}$ to bring optical flow values into the (0,1) range. The probability of selecting a patch is inversely proportional to optical flow magnitude $ P = 1-\left\| F_{B} \right\|$.
In the multi-task training strategy, we reverse masking to keep the most informative parts visible, tailoring the reconstruction task to better align with the specific downstream task, with probability distribution proportional to optical flow magnitude $ P = \left\| F_{B} \right\|$.
%
We observe from surgical videos that the surgeon's hands sometimes exhibit more gestures compared to the tools in the scene. However, hand movement does not contribute to understanding the current phase. Thus, higher flow dynamics might be a sub-optimal feature for classifying the event. 
To address this concern, we extend the model $f_{\theta}$ (\autoref{eq:1}) with an additional encoder  $e$  that estimates the probability distribution  \( P \) for a set of patches \( B \) from optical flow  \( F_{B} \). 

\begin{equation}
    V', Y' = f_{\theta} (V, e(F))
    \label{eq:2}
\end{equation}

 In order to facilitate the gradient flow, we concatenate the encoded optical flow to the input frames $\left[ e(F_{t}), X_{t} \right]$ as an additional channel. Hence, the probability map is implicitly learned throughout the optimization of the model $f_{\theta}$. 
 

\paragraph{Multitask Model}
\label{sec: multitask_model}

The sequential approach is extended using a multitask training strategy, where tasks are learned jointly. This model has two distinct output heads: the reconstruction head operates as a decoder, and the classification head consists of linear layers (\autoref{fig:pipeline}-A and \autoref{fig:pipeline}-B). The reverse masking choice described previously is validated by comparisons in \autoref{tab:multi_masking_comparison}, visualized in \autoref{fig:masking_strategies}. The encoder-decoder path for reconstruction follows the VideoMAE framework \cite{tong2022videomae}. For classification, encoded features pass through a pooling layer before entering a classification head with two fully connected layers: the first has 256 hidden dimensions with ReLU activation and dropout, and the final layer produces class logits.
 

\subsection{Objective Functions}
\label{sec:loss_functions}


In a multitask learning strategy, the overall loss function is a weighted combination of two task-specific losses, while the two-step training strategy employs each loss independently.

$
L_{\text{total}} = \alpha \cdot L_{\text{rec}} + \gamma \cdot L_{\text{CE}},
$ where $\alpha$ and $\gamma$ are the weighting terms for the reconstruction and classification losses, respectively. Mean Squared Error (MSE) is used for the reconstruction task, measuring the difference between pixel values of original and reconstructed frames. The classification loss is calculated using Cross Entropy Loss, which evaluates the model's output probability between 0 and 1.


