\section{Introduction}
Spinal health is a critical aspect of overall well-being and quality of life. Early and accurate diagnosis of vertebral body anomalies is essential for appropriately treating spinal disorders. Osteoporotic fractures, for instance, affect up to 12\% of men and women aged 50-79 years across Europe \citep{harvey2010osteoporosis}. CT has become an indispensable tool for diagnosing vertebral fractures. However, manual interpretation of CT scans can be time-consuming and subjective, potentially leading to errors and delays in diagnosis and treatment \citep{carberry2013unreported}. Deep learning has already demonstrated promising results in automating the detection of vertebral fractures \citep{husseini2020grading, engstler2022interpretable, keicher2023semantic}. However, these studies have also highlighted the need for interpretable methods, as understanding the decision-making process of the models is crucial for building trust and ensuring reliable clinical application.
Vision Transformers (ViTs) \cite{dosovitskiy2020image} have shown promise for this due to their inherent interpretability through attention visualizations. However, their application has been primarily limited to 2D medical images \cite{chlkad2023deep}, as they are data-hungry and often rely on initialization from models pretrained on large-scale 2D image datasets like ImageNet~\cite{deng2009imagenet}. This limits their effectiveness in tasks involving volumetric data, such as vertebral fracture detection in CT images. A potential solution is using models pretrained on videos, which are also 3D data with spatial and temporal dimensions \cite{ke2023video}.
Video-pretrained models offer a promising solution for initializing ViTs for 3D medical image analysis, but the domain shift between videos and medical images is substantial. An alternative approach is to use self-supervised pretraining of ViTs with in-domain data, which has been shown to improve anatomical image understanding and enhance downstream task performance \cite{tang2022self}. Surprisingly, we find that published self-supervised ViT pretraining models significantly underperform on our task compared to CNN models, which show similar performance whether randomly initialized or pretrained on CT patches using Models Genesis \cite{zhou2021models}.
While there are many publicly available CT datasets containing spine images, vertebral fractures are rare, and datasets including these annotations are few, with a high imbalance of classes.
We argue that pretraining on a task-specific unlabeled dataset with self-supervised methods, even though it contains mainly healthy vertebrae, can help the model to understand the anatomy and improve performance in detecting pathologies.
Therefore, we curate a task-specific pretraining dataset and propose a novel approach that combines the benefits of transfer learning and self-supervised pretraining for vertebral fracture detection in CT scans. Our main contributions are:

\begin{itemize}
\item We propose a framework that allows Vision Transformers to effectively detect vertebral fractures in 3D CT images despite a low data regime, outperforming CNN-based methods while providing inherent interpretability through attention visualizations.
\item We introduce a self-supervised domain adaptation method and a new task-specific pretraining dataset to bridge the gap between video-pretrained models and medical images, enabling the learning of relevant anatomical features in the target domain.
\item Our thorough experimental evaluation demonstrates the effectiveness of the proposed task-specific pretraining in improving downstream task performance for both existing pretraining methods and our novel adaptation of video-based transfer learning.
\end{itemize}