\section{Method}
Our study involved creating a dataset comprising 27,776 unlabeled vertebra crops (refer to section Experimental Setup \ref{sec:experimental_setup} for more details). However, it's noteworthy that our dataset's volume is still small when compared to extensive datasets like ImageNet \citep{deng2009imagenet} and Kinetics-700 \citep{carreira2019short}. Transformer-based models typically necessitate training on expansive datasets \citep{li2023transforming}, prompting our approach to leverage transfer learning already in the pretraining stage. Specifically, we explore using pretrained weights from data-rich domains to boost self-supervised pretraining for our vertebra CT data.  Initializing CT models with ImageNet weights appears suboptimal, as they lack the capacity to capture crucial 3D details present in CT volumes but absent in 2D images. Therefore, we are focusing on using weights pretrained in the video domain. Videos, embodying 3D spatiotemporal data, present a closer alignment with the characteristics of CT volumes, potentially offering a more effective foundation for our research.

Our video-CT domain adaptation method comprises three steps (see \figureref{fig:method_overview}). First, we employ weights from self-supervised pretraining on the Kinetics-700 video dataset, using video MAE pretraining \citep{feichtenhofer2022masked}, to establish our foundation model.  The second, domain adaptation step, involves adapting these weights for vertebra CT data by pretraining on an unlabeled vertebra dataset, enhancing model alignment with the CT domain. Finally, we apply these adapted weights to vertebra fracture classification, finetuning the encoder with a labeled dataset.

% Method Overview Figure
\begin{figure}[!htb]
\floatconts
  {fig:method_overview}
  {\caption{Overview Self-supervised Video-CT Domain Adaptation: 1) video MAE pretraining on the Kinetics-700 dataset 1)-2) positional encoding cropping to initialize domain-specific vertebra CT pretraining 2) domain-specific vertebra MAE pretraining on unlabeled vertebra CT dataset 3) downstream task finetuning}}
  {\includegraphics[width=1.0\linewidth]{chapters_209/images_209/method_overview_gap_209.png}}
\end{figure}

\paragraph{Video MAE for CT Vertebra Data}
Our approach features two self-supervised pretraining stages, both leveraging the video MAE method \citep{feichtenhofer2022masked}. To align the original $96 \times 96 \times 96$ dimensions of vertebra CT volumes with the $16 \times 224 \times 224$ video format from the Kinetics-700 dataset, we select 16 equidistant sagittal slices from the vertebra volume. We use sagittal slices due to their diagnostic relevance in vertebral fracture detection, as they provide crucial information for accurate assessments in this context.

\paragraph{Positional Encoding Cropping}
After matching the temporal dimension of the video format, the 3D convolutional patch embedding layer ($2 \times 16 \times 16$) from video pretraining can be reused in the vertebra CT pretraining stage. However, with different input sizes ($224 \times 224$ frames for video and $96 \times 96$ slices for CT), input token counts differ (video: $8 \times 14 \times 14$ tokens, CT: $8 \times 6 \times 6$ tokens), which results in a shape mismatch of the positional encodings. To address this, we introduce "positional encoding cropping", preserving only central $96 \times 96$ pixel positional encodings from videos, and discarding outer encodings (\figureref{fig:method_overview} red). This allows for the direct initialization of positional encoding weights in domain-specific vertebra CT pretraining using video weights, similar to the approach presented by \citet{kim2023region}.

\paragraph{Slice Sampling}
During vertebra pretraining and downstream task finetuning, we employ different sampling strategies for training and evaluation phases. In the training phase, we sample 16 uniformly distributed sagittal slices. Each training step involves applying a consistent random shift to the chosen slice indices, thereby increasing diversity and effectively serving as an augmentation technique. During the evaluation phase, we maintain our initial sampling strategy but do not shift the selected slice indices. Instead, we use all six possible $16 \times 96 \times 96$ samples covering the entire vertebral volume for prediction. This approach allows aggregation of the prediction by averaging over all sampled subvolumes, ensuring a more robust and representative result. This approach draws inspiration from multi-view testing in the video domain \citep{feichtenhofer2019slowfast}.
