\section{Experimental Setup}

\subsection{Hardware Specifications}
All trainings conducted in this paper were performed using a single Nvidia A40 GPU.

\subsection{Preprocessing Pipeline}\label{app:preprocessing_pipeline}
\figureref{fig:preprocessing_pipeline} presents an overview of the preprocessing pipeline, while \figureref{fig:3d_vertebra_crop_visualization} provides a detailed view of a single vertebra crop produced by the pipeline.

% Figure Preprocessing Pipeline
\begin{figure}[ht]
\floatconts
  {fig:preprocessing_pipeline}
  {\caption{Overview Preprocessing Pipeline: 1) input CT scan 2) segmentation mask output of segmentation model 3) cropping of $96 \times 96 \times 96$ vertebra crop based on spline running through spine (orange)}}
  {\includegraphics[width=0.6\linewidth]{chapters_209/images_209/preprocessing_pipeline_209.png}}
\end{figure}

% Figure 3D Vertebra Crop Example
\begin{figure}[ht]
\floatconts
  {fig:3d_vertebra_crop_visualization}
  {\caption{Visualization of Centered $96 \times 96 \times 96$ Vertebra Crop: 1) 3D visualization 2) mid-sagittal slice 3) mid-axial slice 4) mid-coronal slice}}
  {\includegraphics[width=0.7\linewidth]{chapters_209/images_209/3d_vertebra_209.png}}
\end{figure}

\subsection{Unlabeled Vertebra Pretraining Dataset}
This appendix lists additional references related to the public datasets used in our study. These works, while not cited directly in the main text, provide important background and context for the datasets. Their inclusion here acknowledges their contribution to the field and offers readers further resources on the topic.
% general TCIA citation
\citep{clark2013cancer}
% CT COLONOGRAPHY publication citation
\citep{johnson2008accuracy}
% HNSCC-3DCT-RT publication citation
\citep{bejarano2019longitudinal}
% Verse other citations appreciated
\citep{liebl2021computed} \citep{loffler2020vertebral}

\subsection{Labeled Vertebra Downstream Task Dataset}
\tableref{tab:labeled_vertebra_dataset} provides a detailed summary of the labeled dataset from Klinikum Rechts der Isar (Munich) \citep{foreman2024deep}. This dataset was used for the downstream task trainings.

% Table Labeled Dataset
\begin{table}[h]
\centering
\begin{tabular}{|c|c|c|}
\hline
\multicolumn{1}{|c|}{} & \multicolumn{2}{c|}{\bfseries classes} \\ \cline{1-3}
\bfseries split & \bfseries no fracture & \bfseries fracture \\
\hline
\hline
training & 3,336 & 556 \\
\hline
validation & 947 & 211 \\
\hline
test & 1,022 & 173 \\
\hline
\textbf{TOTAL} & \textbf{5,305} & \textbf{940} \\
\hline
\end{tabular}
\caption{Labeled Vertebra Downstream Task Dataset}
\label{tab:labeled_vertebra_dataset}
\end{table}

% Implementation Details
\subsection{Implementation Details}

% Pretraining Settings
\paragraph{Self-supervised Domain Adaptation Pretraining}
In our experimental setup, we employ the video MAE pretraining method by \citet{feichtenhofer2022masked} for both video and domain-specific vertebra CT pretraining. For the MAE model we use the ViT-Large version \citep{dosovitskiy2020image} as encoder and a decoder depth of 4. We use convolutional patch embeddings of size $2 \times 16 \times 16$. A masking ratio of 0.8 is applied. We initialize domain-specific vertebra CT pretraining with weights from Kinetics-700 pretraining, utilizing positional encoding cropping. Training consists of 100 epochs with a batch size of 8, using the AdamW optimizer and a cosine learning rate scheduler (base learning rate 1e-3). Data preprocessing involves clipping HU values to the range -1000 to 1000 and then scaling them to 0-1. To match the RGB video format, we replicate the HU values. The input volumes have a shape of $16 \times 96 \times 96$, following the described sampling technique.

% Downstream Task Settings
\paragraph{Downstream Task Finetuning}
The MAE pretraining architecture is adapted for the subsequent classification task, exclusively utilizing the ViT-Large encoder. Global average pooling is applied to the output tokens, followed by a linear classification layer for binary classification. Finetuning is done for 50 epochs, employing a batch size of 64, Adam optimizer, and a cosine learning rate scheduler with warmup (base learning rate 1e-5). Class weighting is introduced to the cross entropy loss to handle data class imbalance. Data preprocessing remains consistent with pretraining. The labeled vertebra dataset is split into training, validation, and testing sets at a 60\%/20\%/20\% ratio, ensuring patient-level separation to prevent test-leakage.

\newpage

\section{Results and Discussion}

\subsection{Ablation Study - Positional Encoding Cropping}

% Figure Ablation Study Pos Enc Cropping Loss Curves Comparison
\begin{figure}[ht]
\floatconts
  {fig:ablation_loss_comparison}
  {\caption{Ablation Study - Positional Endocing Cropping: Pretraining Loss Comparison}}
  {\includegraphics[width=0.7\linewidth]{chapters_209/images_209/pretraining_loss_curves_209.png}}
\end{figure}


\subsection{Method Comparison - Implementation Details Baseline Methods}\label{app:baselines_implementation}
This appendix details the implementation of the pretraining methods in our study, adapted to our dataset setup. Following our Video-CT MAE method's dataset setup (see Experimental Setup \ref{sec:experimental_setup}), we first pretrain on the unlabeled vertebra dataset, then finetune on the labeled dataset, using $96 \times 96 \times 96$ vertebra crops.

\subsubsection{Models Genesis}
\paragraph{Pretraining \citep{zhou2021models}}
To better align with our vertebra crops, we increased the input size from the original $64 \times 64 \times 32$ to $96 \times 96 \times 96$, while maintaining all other original settings. Pretraining was initialized with publicly available pretrained weights.\footnote{\url{https://github.com/MrGiovanni/ModelsGenesis/tree/master}}

\paragraph{Finetuning}
We modified the pretrained model by utilizing solely its encoder and appending a binary classification head consisting of two layers.


\subsubsection{ViT UNETR}
\paragraph{Pretraining}
We followed the official public implementation and initialized pretraining with publicly available pretrained weights.\footnote{\url{https://github.com/Project-MONAI/tutorials/tree/main/self_supervised_pretraining}}

\paragraph{Finetuning}
We used the pretrained ViT-Base encoder for the downstream task by appending a classification token and integrating a classification head on top of this token.


\subsubsection{Swin UNETR}
\paragraph{Pretraining \citep{tang2022self}}
We followed the official implementation and initialized pretraining with publicly available weights. The Swin encoder was initialized with public weights, but the decoder was randomly initialized due to a shape mismatch.\footnote{\url{https://github.com/Project-MONAI/research-contributions/tree/main/SwinUNETR/Pretrain}}

\paragraph{Finetuning}
We exclusively utilized the Swin ViT encoder. Global average pooling was applied to the output tokens, followed by a binary classification head.


\subsubsection{MAE}
\paragraph{Pretraining \citep{he2022masked}}
We adapted this method for our 3D CT data with minimal changes, such as transitioning from 2D to 3D convolutions in the patch embedding layer and modifying positional encoding to 3D. Apart from these adaptions, we maintained the original settings, choosing ViT-Base encoder and a masking ratio of 0.75.\footnote{\url{https://github.com/facebookresearch/mae}}

\paragraph{Finetuning}
In the finetuning phase, we exclusively utilized the ViT-Base encoder, complementing it with a binary classification head attached to the classification token.

