\section{Results and Discussion}
\subsection{Ablation Study}
This section details our ablation study results, analyzing video transfer learning and self-supervised domain adaptation's impact on the vertebra classification downstream task. Table \ref{tab:ablation_study} outlines the performance of the key components in our method.

% Ablation Study Results Overview
\begin{table}[ht]
\centering
\resizebox{\textwidth}{!}{%
\begin{tabular}{|l|c|c|c|c|c|c|} 
 \hline
 \bfseries Ablated component & \bfseries F1 (\%) & \bfseries ACC (\%) & \bfseries AUC (\%) & \bfseries AP (\%) & \bfseries FT Min/Ep\tablefootnote{Finetuning Minutes/Epoch} \\ 
 \hline
 \hline
 1) $-$ Video Pretraining & 74.6 & 92.1 & 93.1 & 82.6 & 9 \\
 2) $-$ Vertebra Pretraining & 85.7 & 95.5 & 98.0 & 92.5 & 9 \\
 3) $-$ Multi-View Sampling & 83.7 & 94.6 & 96.5 & 90.1 & 9 \\
 4) $-$ Positional Encoding Cropping & 87.1 & 96.2 & 97.7 & \textbf{93.3} & 9 \\
 5) $-$ Vertebra Format Adaptation & 87.5 & 96.0 & 97.5 & 90.7 & 18 \\
 6) $-$ 2) and 4) & 69.1 & 90.9 & 91.8 & 76.5 & 9 \\
 \hline
  \textbf{Video-CT MAE} & \textbf{88.4} & \textbf{96.4} & \textbf{98.2} & 93.2 & 9 \\
 \hline
\end{tabular}
}
\caption{Ablation Study: 1) no video pretraining 2) no domain-specific vertebra CT pretraining 3) no multi-view sampling during inference 4) no positional encoding cropping - randomly initialized positional encodings for vertebra CT pretraining 5) original $16 \times 224 \times 224$ video format by adding padding to the $96 \times 96$ slices}
\label{tab:ablation_study}
\end{table}

\paragraph{Video Pretraining}
The removal of video domain transfer learning significantly reduces performance. This emphasizes the importance of using video domain pretrained weights for our domain-specific vertebra CT pretraining, ensuring a solid foundation and enabling effective transfer of learned features from the video domain to the CT domain.

\paragraph{Vertebra Pretraining}
The results demonstrate that video pretraining can be effectively adapted to the CT domain with proper format adjustments, even without domain adaptation. Video pretraining alone outperforms vertebra-only pretraining, aligning with the findings of \citet{ke2023video} and \citet{rajpurkar2020appendixnet}. However, skipping vertebra pretraining domain adaptation leads to lower performance than the full Video-CT MAE pipeline.

\paragraph{Multi-View Sampling}
Additionally, multi-view sampling boosts prediction robustness. Combining different views of the vertebra leads to a more reliable final prediction.

\paragraph{Positional Encoding Cropping}
One can see that by randomly initializing the positional encodings, the performance closely aligns with that of our full Video-CT MAE method. Yet, a closer analysis of the pretraining loss curves offers a significant insight. Employing positional encoding cropping enables a reduction in training epochs (see \figureref{fig:ablation_loss_comparison}). Another finding is that the direct use of video weights for the downstream task significantly benefits from the positional encoding cropping, as demonstrated in ablation experiment 6.

\paragraph{Vertebra Format Adaptation}
Using the video model with its original $224 \times 224$ frame size from pretraining yields performance comparable to our full Video-CT MAE method. However, this approach results in increased training time for pretraining and finetuning.

\subsection{Self-supervised Domain Adaptation}
In this section, we conduct a comparative analysis between our Video-CT MAE approach and established self-supervised pretraining methods. Our focus centers on the application of these methods to our vertebra data setup, namely: Models Genesis \citep{zhou2021models}, ViT UNETR \citep{tang2022self}, Swin UNETR \citep{tang2022self}, and MAE \citep{he2022masked}. We study the importance of task-specific pretraining by: First, random initialization for downstream task finetuning; second, finetuning with publicly available weights; and finally, task-specific self-supervised pretraining using the public weights for initialization.

% Method Comparison Results Overview
\begin{table}[ht]
\centering
\resizebox{\textwidth}{!}{\begin{tabular}{|c|c|c|c|c|c|} 
 \hline
 \bfseries Method & \bfseries Pretraining Data & \bfseries F1 (\%) & \bfseries ACC (\%) & \bfseries AUC (\%) & \bfseries AP (\%) \\ 
 \hline
 \hline
 Models Genesis 3D & - & 85.8 & 95.6 & 97.8 & 92.1 \\
 Models Genesis 3D & 623 CT images\tablefootnote{Models Genesis 3D pretraining dataset: LUNA16 \citep{setio2017validation}} (public) & 85.9 & 95.7 & \bfseries 98.1 & 92.3 \\
 \bfseries Models Genesis 3D & \bfseries public $\rightarrow$ vertebrae & \bfseries 87.1 & \bfseries 96.1 & 97.8 & \bfseries 92.9 \\
 \hline
 ViT UNETR & - & 34.2 & 56.4 & 65.2 & 22.2 \\
 ViT UNETR & 771 CT images\tablefootnote{ViT UNETR pretraining dataset: TCIA-Covid19 \citep{an2020ct}} (public) & 55.9 & 86.8 & 83.7 & 64.3 \\
 \bfseries ViT UNETR & \bfseries public $\rightarrow$ vertebrae & \bfseries 73.6 & \bfseries 91.6 & \bfseries 94.0 & \bfseries 84.1 \\
 \hline
 Swin UNETR & - & 36.2 & 61.7 & 69.6 & 26.9 \\
 Swin UNETR & 5.050 CT images\tablefootnote{Swin UNETR pretraining datasets: TCIA-Covid19 \citep{an2020ct}, LUNA16 \citep{setio2017validation}, HNSCC \citep{grossberg2020md},  LiDC \citep{armato2011lung}, TCIA Colon \citep{johnson2008accuracy}} (public) & 57.0 & 86.9 & 83.3 & 65.9 \\
 \bfseries Swin UNETR & \bfseries public $\rightarrow$ vertebrae &  \bfseries 71.3 & \bfseries 91.6 & \bfseries 89.0 & \bfseries 76.0 \\
 \hline
 3D MAE & - & 35.1 & 58.8 & 70.3 & 30.2 \\
 \bfseries 3D MAE & \bfseries vertebrae & \bfseries 75.2 & \bfseries 92.9 & \bfseries 92.5 & \bfseries 82.0 \\
 \hline
Video MAE & - & 33.3 & 84.5 & 65.5 & 23.7 \\
 Video MAE & 650.000 video clips\tablefootnote{Video MAE pretraining dataset: Kinetics-700 \citep{carreira2019short}} (public) & 80.1 & 93.9 & 96.5 & 89.1 \\
 \bfseries Video MAE & \bfseries public $\rightarrow$ vertebrae & \bfseries 84.1 & \bfseries 95.3 & \bfseries 96.8 & \bfseries 90.4 \\
 \hline
 Video-CT MAE & - & 34.8 & 68.1 & 67.6 & 28.3 \\
 Video-CT MAE & 650.000 video clips (public) & 85.7 & 95.5 & 98.0 & 92.5 \\
 \cellcolor{lightgray} \bfseries Video-CT MAE (ours) & \cellcolor{lightgray} \bfseries public $\rightarrow$ vertebrae & \cellcolor{lightgray} \bfseries 88.4 & \cellcolor{lightgray} \bfseries 96.4 & \cellcolor{lightgray} \bfseries 98.2 & \cellcolor{lightgray} \bfseries 93.2 \\
 \hline
\end{tabular}}
\caption{Comparison with other self-supervised pretraining methods}
\label{tab:ssl_method_comparison}
\end{table}

Models leveraging both public and task-specific pretraining consistently surpassed those limited to public data pretraining or no pretraining at all. This was particularly evident in transformer-based models (ViT UNETR, Swin UNETR, MAE, and our Video-CT MAE), underscoring the critical role of pretraining in these architectures. Models Genesis demonstrated strong performance without pretraining, suggesting that CNN-based models may be less reliant on extensive pretraining for smaller datasets. Our Video-CT MAE method proved to be the most effective, surpassing all other evaluated methods across all metrics.


\subsection{Vertebra Fracture Detection}
We evaluate our method for vertebral fracture detection by comparing it with an existing technique and common 3D classification architectures in medical settings. In addition, we show the challenges of training 3D ViTs from scratch, which inspired our proposed approach.

% Vertebra Fracture Detection Overview
\begin{table}[ht]
\centering
\small
\begin{tabular}{|c|c|c|c|c|}
 \hline
 \bfseries Method & \bfseries F1 (\%) & \bfseries ACC (\%) & \bfseries AUC (\%) & \bfseries AP (\%) \\ 
 \hline
 \hline
 
 DenseNet121 \citep{huang2017densely} & 73.7 & 92.1 & 92.1 & 81.1 \\
 DenseNet169 \citep{huang2017densely} & 72.9 & 91.7 & 93.3 & 83.9 \\
 \hline
 ResNet18 \citep{he2016deep} & 81.8 & 94.6 & 95.4 & 89.6 \\
 ResNet50 \citep{he2016deep} & 79.5 & 93.7 & 94.6 & 85.0 \\
 \hline
 ViT-B \citep{dosovitskiy2020image} & 32.2 & 61.2 & 63.0 & 19.7 \\
 ViT-L \citep{dosovitskiy2020image} & 33.5 & 55.0 & 64.7 & 22.0 \\
 \hline
 \citet{engstler2022interpretable} & 85.1 & 95.4 & 96.2 & 89.1 \\
 \hline
 \bfseries Video-CT MAE (ours) & \bfseries 88.4 & \bfseries 96.4 & \bfseries 98.2 & \bfseries 93.2 \\
 \hline
\end{tabular}
\caption{Comparison with state-of-the-art vertebra fracture detection}
\label{tab:fracture_detection_comparison}
\end{table}

Our approach significantly outperforms conventional classification architectures and shows superior results to the vertebral fracture classification method of \citet{engstler2022interpretable}. Our presented method successfully implements 3D ViTs in a challenging 3D medical context, characterized by the scarce and imbalanced labeled data.