\section{Related Work}
\paragraph{Vertebral Fracture Detection}
Deep learning-based vertebral body classification methods can be divided into 2D and 3D approaches. 2D methods analyze a single sagittal slice from the 3D vertebra volume \citep{husseini2020grading}, which is efficient and allows initialization with ImageNet weights but fails to capture important 3D structural features. More advanced 2.5D approaches aggregate information from multiple 2D slices using RNNs \citep{bar2017compression} or LSTM networks \citep{tomita2018deep} but still do not fully exploit the 3D spatial context of CT data. Recent studies predominantly utilize 3D models to leverage the comprehensive volume data \citep{engstler2022interpretable, keicher2023semantic, chettrit20203d}, with the initial application of 3D convolutions in vertebral fracture detection pioneered by \citet{nicolaes2020detection}. However, these models cannot use pretrained ImageNet weights and must be trained from scratch, which is challenging given the limited labeled medical datasets.

Recent studies are concentrating more on applying transformer-based techniques to classify vertebrae. \citet{chlkad2023deep} explored the effectiveness of transformers in identifying cervical spine fractures from single 2D slices. Similarly, \citet{windsor2022context} employed a hybrid approach, combining a CNN for 2D feature extraction from sagittal slices with transformers for feature aggregation within and across scans. However, these methods have limitations in capturing 3D structural features due to their reliance on 2D inputs.

\paragraph{Self-supervised Learning}
In recent years self-supervised learning has emerged as a popular approach for pretraining deep learning models. A recent trend in computer vision is transformer-based masked image modeling approaches, which have been inspired by the success of masked language modeling in NLP as demonstrated by BERT \citep{devlin2018bert}. Masked image modeling has already been well-established in computer vision for 2D images \citep{he2022masked}, videos \citep{feichtenhofer2022masked} and multimodal models \citep{girdhar2023omnimae}. More recent methods like MSN \citep{assran2022masked} and I-JEPA \citep{assran2023self} improve efficiency by using joint-embedding architectures, avoiding pixel reconstruction unlike traditional approaches.

Most famous self-supervised learning methods were initially designed for natural images, but there have been significant advancements in the medical field as well. One noteworthy contribution is Models Genesis, an approach that involves pretraining a CNN-based model \citep{zhou2021models}. Two other prominent methods, both rooted in transformer-based techniques, leverage the fusion of multiple pretext tasks such as image restoration, contrastive learning, and image rotation prediction \citep{tang2022self}.

Current self-supervised pretraining methods face two key challenges: Using non-medical data offers less domain-specificity for tasks like CT-based vertebral fracture detection, while medical-specific self-supervised learning is limited by smaller datasets compared to the natural image and video domains.

\paragraph{Video Pretraining for CT Analysis}
Addressing the issue of deep learning's reliance on large labeled datasets like ImageNet \citep{deng2009imagenet} in medical applications, recent research has underscored the benefits of leveraging extensive video datasets for pretraining 3D medical models \citep{zunair2021viptt}. \citet{ke2023video} and \citet{rajpurkar2020appendixnet} both found that pretraining 3D medical models on large-scale, out-of-domain video datasets yields better performance than training from scratch or using conventional in-domain CT datasets. These models, pretrained in a different domain, encounter drawbacks when applied to medical CT data without appropriate domain adaptation.
