\section{Related Work}
Recent works in event prediction from video sequences focus on fusing spatiotemporal information into relevant features for classification. 
%
The Masked Autoencoder \cite{tong2022videomae} is a popular choice for video understanding due to three main reasons.
%
 First, it reconstructs missing parts using contextual information, improving comprehension of complex events across frames. 
%
Second, its transformer architecture facilitates robust representation learning in the image domain. 
%
Lastly, MAE enables unsupervised pre-training without requiring labels, reducing annotation efforts crucial for video data.
%

\paragraph{Event recognition in the wild}
%
\cite{mao2023masked} proposed a motion-aware masking strategy (MAMP) for 3D human action recognition that predicts masked joint motion from spatiotemporal video sequences, adding semantic information to the masking process.
\cite{sun2023masked} focuses on learning video representation by reconstructing the motion of masked regions, aiming to recover motion trajectories instead of appearance, using the semantics of masked objects inferred from visible patches.
\cite{bandara2023adamae} adapted the REINFORCE algorithm \cite{williams1992simple} to sample visible tokens from a categorical distribution. Their proposed network maximizes expected reconstruction error through policy gradients, surpassing fixed distribution methods.
\cite{huang2023mgmae} introduces a motion-guided masking strategy using optical flow for consistent masking volume. While their approach offers an online solution for flow masking, it is slower than traditional VideoMAE \cite{tong2022videomae}.

\paragraph{Surgical Workflow Recognition}

The early EndoNet by \cite{twinanda2016endonet} offers a method for surgical phase classification and tool position detection in a multi-task framework; and it outperforms single-task methods on the Cholec80 dataset. The authors show that incorporating the tool presence task enhances EndoNet's ability to learn more discriminative features.
SV-RCNet \cite{jin2017sv} combines a CNN and RNN for the Cholec80 dataset. Based on integrating ResNet \cite{he2016deep} and LSTM networks \cite{du2015hierarchical}, it learns visual and temporal features but requires significant resources for design optimization.
%
Contrary to $LSTM$-based methods,  TeCNo \cite{czempiel2020tecno} employs full temporal resolution and large receptive fields for surgical phase prediction. It leverages causal, dilated convolutions for online, fast inference on entire video sequences.
%
Yi \etal \cite{yi2022not} explore various multi-stage architectures by combining pre-trained models for solving surgical phase recognition tasks. 
%
Trans-SVNet \cite{gao2021trans} utilizes transformer architecture to fuse different embeddings for better capturing spatiotemporal information.
%
Recent works \cite{basu2024focusmae,fujii2024egosurgery,jamal2023surgmae} enhance MAE with improved masking procedures, such as estimating high-information regions, deriving masks from gaze-capturing data, or sampling tokens from high-information spatiotemporal areas instead of using random masking.