\documentclass{midl} % Include author names

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution
\usepackage{xcolor}
\usepackage{soul}
\usepackage{graphicx}
\usepackage{rotating}
\usepackage{makecell}
\usepackage{booktabs}
\usepackage{caption}
\usepackage{rotating}
\usepackage{comment}
\usepackage[table,xcdraw]{xcolor}
\usepackage{multirow}
\usepackage{float}
\usepackage{placeins}
% \usepackage{floatrow}
\usepackage{mwe} % to get dummy images
\jmlrvolume{-- 171}
\jmlryear{2025}
\jmlrworkshop{Full Paper -- MIDL 2025}
\editors{Accepted for publication at MIDL 2025}

\title[Learning from a Few Shots]{ Learning from a Few Shots: \\ Data-efficient Cervical Vertebral Maturation Assessment}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor{\Name{Helen Schneider\midljointauthortext{Contributed equally}\nametag{$^{1}$}} 
\Email{helen.schneider@iais.fraunhofer.de} \\
\Name{Aditya Parikh\midlotherjointauthor\nametag{$^{1}$}} \Email{aditya.parikh@iais.fraunhofer.de}\\
\Name{Priya Tomar\nametag{$^{1,2}$}} \Email{priya.priya@iais.fraunhofer.de} \\
\Name{Maximilian Broß\nametag{$^{1}$}} \Email{maximilian.bross@iais.fraunhofer.de} \\
\Name{Tom Verhofstadt\nametag{$^{4}$}} \Email{tom.verhofstadt@mil.be} \\
\Name{Anna Konnerman\nametag{$^{3}$}} \Email{konermann@uni-bonn.de} \\
\Name{Rafet Sifa\nametag{$^{1,2}$}} \Email{rafet.sifa@iais.fraunhofer.de}\\
\addr $^{1}$ Fraunhofer IAIS \\
\addr $^{2}$ University of Bonn \\
\addr $^{3}$ University Hospital Bonn \\
\addr $^{4}$ Medical Service, Belgian Army
}


\begin{document}

\maketitle

\begin{abstract}
The timing of treatment is a crucial decision in orthodontics. Initiating treatment during the appropriate growth phase leads to optimal patient outcomes and can prevent prolonged treatment durations. The most commonly used method for classifying growth phases is cervical vertebral maturation (CVM) assessment, which categorizes CVM into six stages based on the shape and size of the cervical vertebrae. Due to the complexity of manual CVM analysis, it often falls short in performance when assessed visually. Deep learning methods can assist physicians in classifying CVM stages, thus improving orthodontic workflows and treatments. However, a significant challenge in deep learning-based CVM assessment is the limited dataset volume, resulting from difficulties in data collection and annotation. While small training datasets can greatly hinder the model's generalization performance, research on data-efficient training methods for CVM assessment is still lacking. To the best of our knowledge, this paper is the first to evaluate the potential of few-shot learning and in-domain transfer learning for CVM assessment. Specifically, we investigate the architectures ResNet18 and SAM-Med2D. Few-shot learning enhances classification performance by up to $9\%$. Additionally, in-domain pre-training (using chest X-ray data) results in a significant performance increase of up to $4\%$.  
\end{abstract}

\begin{keywords}
Few-shot Learning, Transfer Learning, MedSam, Orthodontic, CVM Assessment
\end{keywords}

\section{Introduction}
In addition to selecting the appropriate orthodontic treatment, the timing of treatment is crucial for achieving successful outcomes \cite{liao2022icvm, mohammad2022deep}. The stage of facial growth impacts diagnosis, prognosis, treatment planning, and results. Correctly classifying the growth phase and initiating treatment during the optimal growth period can lead to the best patient outcomes. Conversely, an incorrect classification may result in prolonged treatment durations or the need for surgical interventions to correct jaw deformities \cite{mohammad2022deep}.
Several indicators of skeletal maturation have been proposed to assist in determining treatment timing, including dental development and eruption times, hand and wrist maturation, and cervical vertebral maturation (CVM) and morphology. While hand and wrist maturation assessments rely on hand radiographs, CVM utilizes lateral cephalograms, which are commonly used in orthodontic diagnostic procedures \cite{mohammad2022deep}. Consequently, CVM assessment is the most frequently employed method by orthodontists, reducing the radiation dose per patient due to the practicality of conventional lateral cephalograms.
CVM is classified into six cervical stages  (CS1 - CS6) based on the size and shape of the cervical vertebrae \cite{liao2022icvm}. Previous studies have shown that this classification is highly reliable \cite{gu2007mandibular, malta2009quantification}. However, differentiating between these stages in a clinical context can be quite challenging. Variations in clinicians' understanding of skeletal morphology lead to unavoidable subjectivity, which complicates accurate assessment. Consequently, CVM evaluation requires considerable clinical experience and has not achieved satisfactory accuracy when assessed visually \cite{liao2022icvm, chatzigianni2009geometric}.

%However, the CVM degree methods have shown suboptimal intraobserver agreement, highlighting difficulties in accurate CVM assessment. 
In recent years, computer vision and deep learning (DL) algorithms have demonstrated remarkable performance in the analysis of medical image data \cite{schneider2023one, schneider2024informed}. This potential is also significant in the field of dentistry, where DL-based diagnostic assistance can greatly enhance clinical practices. Studies such as \cite{tao2022tooth} and \cite{hwang2019overview} have explored various orthodontic applications, including DL-based proximal caries detection. However, cervical vertebral maturation CVM assessment presents a particularly challenging use case for two key reasons. 
On one hand, ambiguous boundaries between neighboring stages and subjectivity in label annotation can lead to noisy/uncertain labels, which may weaken the classification performance of the DL model \cite{liao2022icvm}. 
On the other hand, the limited volume of the data set, resulting from the difficulties associated with data collection, poses a significant challenge for CVM assessment \cite{liao2022icvm, seo2021comparison}. DL methods typically require extensive data sets for supervised training to achieve the remarkable performance necessary for diagnostic decision support. Limited training data can lead to poor generalization behavior of the DL method, hindering their application in clinical worklfows. 
Due to these challenges data-efficient DL methods are of great interest to researchers, companies, and clinics aiming to develop diagnostic decision support systems for CMV assessment.


\section{Related Work}


 %% Transfer learning needs to be included as well 
 
%CVM and ML
To address the challenges faced by physicians, various studies have explored the use of traditional machine learning methods for CVM assessment \cite{kok2019usage, amasya2020cervical, amasya2020validation}. The implemented methods rely on handcrafted features, the best performance was achieved with a feed forward neural network. Nevertheless, the application of these traditional machine learning methods in real-world clinical scenarios is constrained by their limited accuracy.\\

In the recent years DL and computer vision achieved tremendous success in analyzing medical image data  \cite{schneider2023symmetry, schneider2023segmentation}. 
In the fields of orthodonics DL algorithms yielded strong performance among others for abnormality classification in teeth, such as proximal caries detection \cite{hwang2019overview, lin2022detecting}. Additionally, several studies explored the assessment of CVM using DL algorithms, more precisily convolutional neural networks (CNN), obtaining reasonable accuracy \cite{hwang2019overview, kim2021estimating, seo2021comparison, zhou2021development, mohammad2022deep}.\\
However, these studies underline a significant challenge of state-of-the-art CVM assessment: they rely on limited training datasets often fewer than 1,000 samples due to the high costs and time demands of generating extensively annotated data, as well as restricted access to public CVM assessment datasets. These data limitations hinder the performance of supervised DL-based CVM classification.\\
Despite the severe consequences, research on data-efficient DL methods for cervical vertebral maturation (CVM) assessment is still insufficient. State-of-the-art studies solely focuse on transfer learning based on ImageNet-pretrained weights. However, due to the different varying image modalities, no significant performance improvement has been achieved with ImageNet-based transfer learning \cite{makaremi2019deep}. \\

Few-shot learning (FSL) offers a promising approach for data-efficient training \cite{wang2020generalizing,parnami2022learning}. While remarkable successes have been achieved in various application areas, FSL remains underexplored in the field of dentistry. The authors of \cite{kim2024unsupervised}  examined the potential of unsupervised few shot learning for the diagnosis of periodontal disease, highlighting the capability of FSL to address data limitations. However, to the best of our knowledge, FSL has not yet been evaluated for CVM assessment. The aim of our paper is to fill this gap and enable data-efficient CVM classification. Specifically, to the best of our knowledge, our main contributions are: 
\begin{itemize}
    \item First evaluation of FSL for CVM assessment
    \item Initial examination of in-domain (e.g. utilizing medical images) transfer learning for multi-class and FSL CVM assessment 
    \item First investigation of the foundation model MedSam-2D for CVM classification  compared to CNNs for multi-class and FSL training

\end{itemize}

\begin{comment}
To address the challenges faced by physicians, various studies have explored the use of traditional machine learning methods for CVM assessment \cite{kok2019usage, amasya2020cervical, amasya2020validation}. Amasye et al. \cite{amasya2020cervical} examined the potential of several classification methods, such as support vector machines or decision tree models, for the automatic assessment of CVM. The methods rely on handcrafted features, the best performance was achieved with a feed forward neural network. Nevertheless, the application of these traditional machine learning methods in real-world clinical scenarios is constrained by their limited accuracy.\\

In the recent years DL and computer vision achieved tremendous success in analyzing medical image data. For instance, DL models have demonstrated the ability to detect lung diseases in chest X-ray scans \cite{schneider2023one, schneider2024informed, schneider2023symmetry} and assess lumbar MRI scans for abnormalities \cite{schneider2023segmentation}. 
In the fields of dentistry DL algorithms yielded strong performance among others for abnormality detection in teeths, such as proximal caries detection \cite{hwang2019overview, lin2022detecting}. Additionaly, several studies explored the assessment of CVM using DL algorithms \cite{hwang2019overview, kim2021estimating, seo2021comparison, zhou2021development, mohammad2022deep}. In \cite{seo2021comparison, mohammad2022deep} the authors investigated the potential of several state-of-the-art convolutional neural networks (CNN) for the CVM classification, achieving reasonable accuracy.\\
However, these studies underline a significant challenge of state-of-the-art CVM assessment: they rely on limited training datasets of fewer than 1,000 samples due to the high costs and time demands of generating extensively annotated data, as well as restricted access to public CVM assessment datasets. These data limitations hinder the performance of supervised DL-based CVM classification.\\
Despite the severe consequences, research on data-efficient DL methods for cervical vertebral maturation (CVM) assessment is still insufficient. State-of-the-art studies solely focused on transfer learning based on ImageNet-pretrained weights. However, due to the different varying image modalities, no significant performance improvement has been achieved with ImageNet-based transfer learning \cite{makaremi2019deep}. \\

Few-shot learning (FSL) offers a promising approach for data-efficient training \cite{wang2020generalizing,parnami2022learning}. While remarkable successes have been achieved in various application areas, FSL remains underexplored in the field of dentistry. The authors of \cite{kim2024unsupervised}  examined the potential of unsupervised few shot learning for the diagnosis of periodontal disease, highlighting the capability of FSL to address data limitations. However, to the best of our knowledge, FSL has not yet been evaluated for CVM assessment. The aim of our paper is to fill this gap and enable data-efficient CVM classification. Specifically, to the best of our knowledge, our main contributions are: 
\begin{itemize}
    \item First evaluation of FSL for CVM assessment
    \item Initial examination of in-domain (e.g. medical images) transfer learning for multiclass and FSL CVM assessment 
    \item First investigation of state-of-the-art Vision Transformer models compared to CNNs for multiclass and FSL CVM classification
\end{itemize}
\end{comment}



 % CVM in general, what is achieved, what are shortcomings
% --> two mayor problems data set to small and noisy labels

% Data efficient solution: FSL, research to FSL in medical imaging and general
\section{Materials and Methods}

 % Data Set 
 % ResNet and MedSam
 % FSL ( BCE and COntrastive loss)
 % Experiments

\subsection{Dataset}

 %The images had a resolution of 640x1280 pixels. For our experiments, we used a subset named CVM-900-Subset introduced in XYZ, which contains 508 samples with clear CVM stages, eliminating label ambiguity for images that lie on the boundary between two stages.
%The CVM-900-Subset was divided into training and test sets with an 80-20 ratio using stratified splitting to maintain the label distribution. We resized the images to $256 \times 256$ pixels, as the implemented model architecture MedSam-2D only supports this resolution. For data augmentation, we applied random horizontal flipping, color jittering, and random rotation within ±30 degrees.

We used the public dataset CVM-900 provided by \cite{liao2022icvm}. This data set includes 900 clear and distinguishable lateral cephalograms of orthodontic patients aged 7-25 years. The annotation process was conducted through a rigorous multi-expert assessment protocol, where three specialists in orthodontics and radiology independently evaluated all images according to the CVM method developed by \cite{Lamparski1975SkeletalAA}, \cite{BACCETTI2005119}, and \cite{10.2319/111517-787.1}. Each expert classified the images into one of six CVM stages based on the morphological characteristics of the second (C2), third (C3), and fourth (C4) cervical vertebrae. Images with unanimous classification were directly included, while those with discrepancies underwent a consensus review session where the experts collectively determined the final classification.

The images were cropped to ensure the cervical vertebrae appeared in a fixed position, reducing interference from other anatomical structures. The images had a resolution of $640\times 1280$ pixels. For our experiments, we use a subset named CVM-900-Subset introduced in \cite{liao2022icvm}, which contains 508 samples with clear CVM stages, removing label ambiguity for images that lay on the boundary between two stages.
The CVM-900-Subset was divided into training and test sets with an 80-20 ratio using stratified splitting to maintain the label distribution. We resize the images to $256 \times 256$ pixels, to maintain compatibility with the implemented SAM-Med2D architecture. For data augmentation, we applied random horizontal flipping, color jittering, and random rotation within $\pm 30$ degrees.

\begin{comment}
\subsection{Training Methods}
We implement two training methods, traditional supervised multi-class (MC) and FSL training. \\
The MC method involves training a model to classify input data into one of several predefined classes. The model learns to distinguish between them based on features present in the data. One of the most common loss function for MC training is the Cross-Entropy (CE) loss. This method often relies on extensive annotated data to achieve reasonable generalization performance.\\

FSL training enables the model to classify classes with only a small number of labeled examples.  In this paper we implement a N-way-K-shot FSL training, where N represents the number of classes and K denotes the number of examples (or "shots") provided for each class. The training involves two datasets, the support and query set. The first includes K labels training samples for each of the N classes. The second contains new examples for each of the N classes. Utilizing the embeddings from the support set, the model predicts the classification for each example in the query set.
A loss function evaluates the divergence between the model's predictions and the ground truth. We implement to common FSL loss function, the Binary Cross-Entropy and Supervised Contrastive loss. While both loss functions leverage similarity labels between the query samples and  the support data, the contrastive loss additionally clusters examples of the same class together and separates those from different classes in the feature space. For a more detailed introduction in FSL training and the implemented loss functions please refer to \cite{snell2017prototypical,khosla2020supervised}.

\end{comment}

\subsection{Model Architecture}
We utilize two distinct architectures for the CVM assessment. \\
 
\textbf{Modified ResNet18}
The network is built on the widely-used  CNN ResNet \cite{he2016deep} known for its impressive performance in medical imaging tasks. Inspired by \cite{liao2022icvm}, who demonstrated the effectiveness of additional convolutional layers and dropout for CVM assessment, we modified the ResNet18 architecture. We added three convolutional layers after the backbone, incorporating batch normalization and Leaky ReLU activation (alpha=0.1), maintaining the feature dimension to 512. While \cite{liao2022icvm} used dropout probabilities of 0.5, we found that a single dropout layer with p=0.3 before the global average pooling and final linear classification layer was sufficient for our task. Both the dropout layer and batch normalization are intended to mitigate overfitting and enhance training stability. Despite the additional convolutional layers, we will refer to this architecture in the following as ResNet18 due to overview reasons.
We initialize the ResNet backbone with ImageNet pretrained weights, for the other weights we use Kaiming initialization. Additionally, we aim to evaluate the impact of more effective transfer learning for data-efficient training. To achieve this, we implement pre-training on medical image data, specifically utilizing chest X-ray images from the MedicalMNIST dataset \cite{yang2023medmnist}. In the following sections, we refer to this weight initialization as in-domain transfer learning or Med-ResNet18. \\

% Due to weak training performance of the ResNe18 architecture, we added three additional convolutional layers after the backbone, incorporating batch normalization and Leaky ReLU activation (alpha=0.1). We reduced the feature dimension from xy to 512. A dropout layer (p=0.3) is included before the global average pooling and the final linear classification layer

\textbf{SAM-Med2D Encoder}
The second architecture leverages the image encoder from SAM-Med2D \cite{cheng2023sammed2d}, a medical domain adaptation of the Segment Anything Model (SAM) \cite{kirillov2023segany}. SAM-Med2D was pre-trained on an extensive medical image dataset, encompassing over 4.6 million images across various modalities. We utilize the Vision Transformer (ViT)-base variant of SAM-Med2D as our backbone encoder. This choice allows us to capitalize on its pre-trained weights, which are specifically tuned for medical imaging tasks, and leverage the powerful transformer architecture. The classification head maintains an identical structure to the ResNet architecture, adapting only the initial input channels from 256 to match the encoder's output. Please note that adapting powerful segmentation foundation models, such as MedSam, for classification tasks is an underexplored research area. This work therefore offers further insights into the adaption of segmentation foundation models for orthodontic use cases.\\
Both models output logits corresponding to the six CVM stages.

% @Aditya what grid searches where implemnted, how did we find the final hyperparamters, what are the final parameters, which packages and versions did we use for the models and pytorch, torchmetrics etc. everything we would ne to reproduce the paper
\subsection{Experiments}

Our experiments are conducted using the PyTorch framework on a NVIDIA A100 GPU with 40GB VRAM. To ensure reproducibility, all experiments are executed with three fixed random seeds.

For the ResNet-based models, we employ a batch size of 32 and trained for 100 epochs with early stopping to prevent overfitting with a learning rate of $1e-3$ . For the medical pretrained architectures (SAM-Med2D and Med-ResNet18), we apply a learning rate of $1e-4$ for the encoder, and $1e-3$ for the classification head to further mitigate overfitting. We additionally explored training with both frozen and unfrozen encoder configurations, ultimately selecting an unfrozen approach due to enhanced performance. A learning rate scheduler is implemented to reduce the rate by a factor of $0.1$ after 20 epochs. Due to the limited data volume, we initialize the models with the pre-trained ImageNet weights and use the Cross-Entropy (CE) loss function for tradition multi-class (MC) training. We include the  training results of a randomly initialized model as baseline into our analysis.

For few-shot learning (FSL) training, an extensive hyperparameter search is performed, evaluating k-shot values of $\{1, 3, 5, 10, 20\}$ and query sizes of $\{5, 10, 20\}$, along with learning rates of $\{1e-3, 1e-4\}$. The training consists of 20 epochs, each with 100 randomly sampled tasks, and model performance is assessed using 50 fixed validation tasks after each epoch. Based on empirical results and computational efficiency, a query size of 10 is chosen for final experiments. The focus for evaluation is on k-shot values of 1, 3, and 5, reflecting realistic clinical scenarios where labeled examples are scarce. Both Binary Cross-Entropy (BCE) and Supervised Contrastive Loss (SCL) are employed to analyze the impact of loss functions. The best models are selected based on the lowest loss on the test set. We evaluate the performance regarding the accuracy, relaxed accuracy and mean absolute error (MAE). The relaxed accuracy considers a prediction as incorrect, when it deviates by more than one class. Due to the ambiguous class boundaries for CVM assessment, the relaxed accuracy and MAE are suitable scores to measure the model performance. 

\begin{comment}
For FSL training, we conduct an extensive hyperparameter search across different configurations. We evaluated k-shot values of $\{1, 3, 5, 10, 20\}$ and query sizes of $\{5, 10, 20\}$, while testing learning rates of $\{1e-3, 1e-4\}$. The FSL training regime consisted of 20 epochs where each epoch consisted of 100 randomly sampled tasks and model performance was evaluated using 50 fixed validation tasks after each epoch. Based on empirical results and computational efficiency considerations, we selected a query size of 10 for our final experiments. For the final evaluation and results analysis, we focused on k-shot values of 1, 3, and 5, as these represent realistic scenarios in clinical settings where obtaining labeled examples is resource-intensive and time-consuming. We conduct both Binary Cross-Entropy (BCE) and Supervised Contrastive Loss (SCL) experiment, to investigate the influence of the implemented FSL loss function.\\
The best model is selected based on the lowest loss on the test set.
\end{comment}

To visualize and interpret our model's focus during predictions, we employ Gradient-weighted Class Activation Mapping (Grad-CAM) using the PyTorch implementation \cite{jacobgilpytorchcam}. We apply Grad-CAM to our modified ResNet-18 model's last convolutional layer (layer4) to generate attention maps highlighting regions influential in classification decisions. 


\section{Results}
Table \ref{tab:model_comparison} shows the highest performance scores from the independent test dataset for the different training methods and architectures. A comprehensive table, considering the k-shot values \{1, 3, 5\}, is included in the Appendix (Table \ref{table:appendix}). 
  The baseline ResNet18 experiments achieve a relatively low accuracy of $58\%$ for traditional MC training. However, the relaxed accuracy of $88\%$ highlights that incorrect predictions typically deviate by only one class. This represents a recognized issue for CVM assesment due to the ambiguous boundaries between neighboring stages and indicates that the model is able to extract the most relevant features.  Figure \ref{fig:tsne} illustrates the t-SNE visualization of high-level features from our ResNet18 model and MC training, showing distinct clusters for the six CVM stages with some overlap. Stages 1 and 6 (earliest and last maturation phases) are the most distinct, while intermediate stages display gradual transitions, reflecting vertebral development. This emphasizes the challenge of ambiguous boundaries, highlighting the importance of future research on managing ambiguous labels for CVM assessment. 

\begin{figure}[H]
    \centering
    \includegraphics[width=0.65\textwidth]{plots/tSNE.png} 
    \caption{t-SNE visualization of the high-level features of images in CVM-900-Subset. The rectangles in different colors represent samples of different CVM stages.}
    \label{fig:tsne}
\end{figure}
  
  Additionally, we observed a notable decrease in performance to $47\%$ accuracy when training the ResNet architecture from scratch. This highlights the advantages of transfer learning on out-of-domain data. Further results for the MC and FSL training with random weight initialization can be found in appendix (Table \ref{table:appendix}). In the following, we organize our results analysis according to the data-efficient methods used.\\
  
\textbf{Transfer Learning}
In-domain transfer learning significantly enhances the performance of the ResNet18 architecture by up to $4\%$ in accuracy for MC training. 
However, the relaxed accuracy shows only a $1\%$ increase, indicating no significant improvement. This suggests that in-domain transfer learning mainly corrects predictions that initially deviated by just one class for ImageNet pretrained ResNet18 experiments. %????????????????
The more complex ViT architecture of the SAM-Med 2D network leads to lower performance in MC training. Despite being pre-trained on a substantial medical image dataset, we observed a strong tendency for overfitting. Further adapting the SAM-Med 2D model to utilize the strong in-domain pre-training for data-efficient CVM assessment represents future work. \\% This indicates that even with strong transfer learning, tcomplex ViT networks and MC training may not be well-suited to address orthodontic use cases with severe data limitations. Further adapting the SAM-Med 2D model for data-efficient CVM assessment represents future work. 

\textbf{Few-shot Learning}
For the ResNet18 experiments, FSL significantly outperforms MC training. Accuracy increases by up to $9\%$ due to FSL training. %more test samples are perfectly predicted even with ambigiuos boundaries between staged. 
The relaxed accuracy of $94\%$ indicates that classes incorrectly classified rarely deviate by more than one class. These enhancement and the MAE reduction of 0.13 further emphasize the improved classification behavior of the model. It is important to note that all these metrics showed significant improvements due to FSL training.  These results highlight the advantages of FSL training for CVM assessment, facilitating data-efficient training for orthodontic use case for the ResNet18 experiments. \\
Despite the significant performance improvements achieved through FSL learning with the ImageNet pre-trained ResNet18, these results are not replicable for the Med-ResNet18 and SAM-Med 2D models.  The powerful ViT architecture and/or the use of medical pre-trained weights lead to overfitting of the limited training data, resulting in no strong performance enhancement. These findings are further visualized in Figure \ref{fig:scl_accuracy_plot}, representing the accuracy scores for different k-shot values for all three architectures. \\
% no significant test ???
Figure \ref{fig:loss_plot} highlights that while SCL outperforms BCE training for all three architectures, only a minimal performance increase is observed. This suggests that the model architecture plays a more crucial role in FSL training than the choice of loss function. \\

\begin{table}[H]
\caption{
Overview of the best ResNet18, Med-ResNet18 and SAM-Med 2D experiments based on the test set. FSL training achieves the highest score for ResNet18, surpassing both MC classification and FSL for the other architectures. Significant differences between the CE training and the FSL methods for one architecture are highlighted with an *. The dagger \textbf{†} represents significant improvement due to in-domain transfer learning.}
\label{tab:model_comparison}
\centering
\begin{tabular}{llccc}
\toprule
\textbf{Model} & \textbf{Loss} & \textbf{Acc. (\%)} $\uparrow$ & \textbf{Relaxed Acc. (\%)} $\uparrow$ & \textbf{MAE} $\downarrow$ \\
\midrule
\multirow{3}{*}{ResNet18} 
 & CE & 53.40 $\pm$ 1.59 & 88.03 $\pm$ 0.51 & 0.59 $\pm$ 0.02 \\
 & BCE (FSL)  & 62.14 $\pm$ 2.10 \textbf{*} & \textbf{94.17 $\pm$ 2.38} \textbf{*} & \textbf{0.46 $\pm$ 0.06} \textbf{*}\\
 & SCL (FSL)  & \textbf{62.46 $\pm$ 4.37 \textbf{*}} & 91.59 $\pm$ 0.46 \textbf{*} & 0.49 $\pm$ 0.03 \textbf{*}\\
\midrule
\multirow{3}{*}{Med-ResNet18} 
 & CE & 57.93 $\pm$ 0.92 \textbf{†} & 89.32 $\pm$ 1.59 & 0.54 $\pm$ 0.03\textbf{†} \\
 & BCE (FSL)  & 56.31 $\pm$ 2.10 & 89.00 $\pm$ 0.92 & 0.57 $\pm$ 0.02 \\
 & SCL (FSL)  & 56.96 $\pm$ 3.57 & 91.26 $\pm$ 0.79 & 0.54 $\pm$ 0.05 \\
\midrule
\multirow{3}{*}{SAM-Med 2D} 
 & CE  & 48.87 $\pm$ 1.65 & 87.06 $\pm$ 0.92 & 0.67 $\pm$ 0.02 \\
 & BCE (FSL) & 47.90 $\pm$ 2.78 & 85.44 $\pm$ 0.79 & 0.68 $\pm$ 0.03 \\
 & SCL (FSL) &49.19 $\pm$ 5.16  & 86.41 $\pm$ 0.79 & 0.66 $\pm$ 0.06 * \\
\bottomrule
\end{tabular}
\end{table}

\begin{figure}[H]
    \centering
    \includegraphics[width=1\textwidth]{plots/SCL_accuracy_plots.png} 
    % \caption{The accuracy of FSL training (red) and the MC baselines (grey) evaluated based on three different architectures: ResNet18 (left), Med-ResNet18 (middle), and SAM-Med 2D (right), with k-shot values $k \in \{1,3,5\}$. The MC performance is strongly influenced by the architecture used, with Med-ResNet18 achieving the highest MC performance, surpassing the ResNet18 architecture. This highlights the advantages of in-domain transfer learning. However, FSL training does not benefit from in-domain transfer learning, as the highest performance is observed with the ResNet18 architecture for two out of three k-shot values. The complex ViT architecture, SAM-Med 2D, shows the lowest performance for both MC and FSL training, regardless of the applied k value. No clear trends are observed for the chosen k-shot values. Overall FSL training outperformes indomain-transfer learning for data-efficient CVM assessment.}
    \caption{The accuracy of FSL training (red) and MC baselines (grey) across ResNet18, Med-ResNet18, and SAM-Med 2D architectures, with k-shot values $k \in \{1,3,5\}$. MC performance peaks with Med-ResNet18, showing benefits of in-domain transfer learning. However, FSL training achieves highest accuracy with ResNet18 for most k-shot values, indicating no advantage from in-domain transfer learning for FSL. SAM-Med 2D shows lowest performance across all settings. We observe no clear trends for k-shot values. Overall, FSL training surpasses in-domain transfer learning for data-efficient CVM assessment.}
    \label{fig:scl_accuracy_plot}
\end{figure}





\begin{comment}
\begin{table}[H]
\caption{
Overview of the best ResNet18, Med-ResNet18 and SAM-Med 2D experiments based on the independent test set. FSL training achieves the highest score for ResNet18, surpassing both MC classification and FSL for the other architectures. In-domain transfer learning is advantageous for the ResNet18 model and MC training. For the SAM-Med 2D experiments, we observe significant overfitting tendencies. The same applies for FSL training with in-domain transfer learning (Med-ResNet18). Significant differences between the CE training and the FSL methods for one architecture are highlighted with an *. The dagger \textbf{†} represents significant improvement due to in-domain transfer learning.}
\label{tab:model_comparison}
\centering
\begin{tabular}{llccc}
\toprule
\textbf{Model} & \textbf{Loss} & \textbf{Acc. (\%)} $\uparrow$ & \textbf{Relaxed Acc. (\%)} $\uparrow$ & \textbf{MAE} $\downarrow$ \\
\midrule
\multirow{3}{*}{ResNet18} 
 & CE & 53.40 $\pm$ 1.59 & 88.03 $\pm$ 0.51 & 0.59 $\pm$ 0.02 \\
 & BCE (FSL)  & 62.14 $\pm$ 2.10 \textbf{*} & \textbf{94.17 $\pm$ 2.38} \textbf{*} & \textbf{0.46 $\pm$ 0.06} \textbf{*}\\
 & SCL (FSL)  & \textbf{62.46 $\pm$ 4.37 \textbf{*}} & 91.59 $\pm$ 0.46 \textbf{*} & 0.49 $\pm$ 0.03 \textbf{*}\\
\midrule
\multirow{3}{*}{Med-ResNet18} 
 & CE & 57.93 $\pm$ 0.92 \textbf{†} & 89.32 $\pm$ 1.59 & 0.54 $\pm$ 0.03\textbf{†} \\
 & BCE (FSL)  & 56.31 $\pm$ 2.10 & 89.00 $\pm$ 0.92 & 0.57 $\pm$ 0.02 \\
 & SCL (FSL)  & 56.96 $\pm$ 3.57 & 91.26 $\pm$ 0.79 & 0.54 $\pm$ 0.05 \\
\midrule
\multirow{3}{*}{SAM-Med 2D} 
 & CE  & 48.87 $\pm$ 1.65 & 87.06 $\pm$ 0.92 & 0.67 $\pm$ 0.02 \\
 & BCE (FSL) & 47.90 $\pm$ 2.78 & 85.44 $\pm$ 0.79 & 0.68 $\pm$ 0.03 \\
 & SCL (FSL) &49.19 $\pm$ 5.16  & 86.41 $\pm$ 0.79 & 0.66 $\pm$ 0.06 * \\
\bottomrule
\end{tabular}
\end{table}

\begin{figure}[H]
    \centering
    \includegraphics[width=0.9\textwidth]{plots/SCL_accuracy_plots.png} 
    % \caption{The accuracy of FSL training (red) and the MC baselines (grey) evaluated based on three different architectures: ResNet18 (left), Med-ResNet18 (middle), and SAM-Med 2D (right), with k-shot values $k \in \{1,3,5\}$. The MC performance is strongly influenced by the architecture used, with Med-ResNet18 achieving the highest MC performance, surpassing the ResNet18 architecture. This highlights the advantages of in-domain transfer learning. However, FSL training does not benefit from in-domain transfer learning, as the highest performance is observed with the ResNet18 architecture for two out of three k-shot values. The complex ViT architecture, SAM-Med 2D, shows the lowest performance for both MC and FSL training, regardless of the applied k value. No clear trends are observed for the chosen k-shot values. Overall FSL training outperformes indomain-transfer learning for data-efficient CVM assessment.}
    \caption{The accuracy of FSL training (red) and MC baselines (grey) across ResNet18, Med-ResNet18, and SAM-Med 2D architectures, with k-shot values $k \in \{1,3,5\}$. MC performance peaks with Med-ResNet18, showing benefits of in-domain transfer learning. However, FSL training achieves highest accuracy with ResNet18 for most k-shot values, indicating no advantage from in-domain transfer learning for FSL. SAM-Med 2D shows lowest performance across all settings. We observe no clear trends for k-shot values. Overall, FSL training surpasses in-domain transfer learning for data-efficient CVM assessment.}
    \label{fig:scl_accuracy_plot}
\end{figure}
\end{comment}
\begin{figure}[H]
    \centering
    \includegraphics[width=1\textwidth]{plots/accuracy_plots.png} 
    \caption{The accuracy for FSL training of the SCL (orange) and the BCE (blue) loss across ResNet18, Med-ResNet18, and SAM-Med 2D architectures, with k-shot values $k \in \{1,3,5\}$. For each architecture, the highest performance is achieved with the SCL loss. The SCL loss surpasses the BCE loss seven out of nine experiments. However, only a minimal difference is observed between the loss functions regarding the highest accuracy scores for each architecture. }
    \label{fig:loss_plot}
\end{figure}
% \FloatBarrier 

% Grad cam 
In addition to the performance evaluation, we analyze the interpretability of the preferred FSL method. Figure \ref{fig:grad_cam} visualizes five patient samples. For the accurate prediction of CS6 (fourth sample), the heat map highlights the high relevance of the cervical vertebrae C2, C3, and C4, indicating a well-informed model decision. We observe a similar correct focus for the patient in CS4 (sixth sample). Conversely, for the incorrect prognosis of class number 1 (first and second sample), the model primarily focuses on the patient's jaw, resulting in an unreliable decision. For the incorrect prediction of the third sample, the model doesn't focus on the highly relevant cervical vertebrae C2,C3, and C4. Although we observed this behavior for several test samples, future work includes an extensive interpretability analysis.

% \begin{figure*}[t]
%    \begin{minipage}[t]{0.65\textwidth}
%        \includegraphics[width=\textwidth]{plots/two_example_comparison.pdf}
%    \end{minipage}
%    \hfill
%    \begin{minipage}[t]{0.3\textwidth}
%        \caption{Grad-CAM visualizations showing model attention: (top) a correct prediction with accurate focus on the target vertebrae, (bottom) an incorrect prediction with diffused and misplaced attention. Original images (left) and attention heatmaps (right) where red indicates regions of highest importance for the model's decision.}
%        \label{fig:gradcam}
%    \end{minipage}
% \end{figure*}
\begin{comment}
\begin{figure}[H]
\floatbox[{\capbeside\thisfloatsetup{capbesideposition={right,top},capbesidewidth=5cm}}]{figure}[\FBwidth]
{\caption{Grad-CAM visualizations showing model attention: (top) a correct prediction with accurate focus on the target vertebrae, (bottom) an incorrect prediction with diffused and misplaced attention.\\ Original images (left) and attention heatmaps (right) where red indicates regions of highest importance for the model's decision.}\label{fig:grad_cam}}
{\includegraphics[width=8cm]{plots/two_example_comparison_new.pdf}}
\end{figure}
\end{comment}



\begin{comment}
Figure \ref{fig:tsne} illustrates the t-SNE visualization of high-level features from our ResNet18 model and MC training, showing distinct clusters for the six CVM stages with some overlap. Stages 1 and 6 (earliest and latest maturation phases) are the most distinct, while intermediate stages display gradual transitions, reflecting vertebral development. This highlights the challenge of ambiguous boundaries, leading to misclassifications between adjacent stages. The proximity of images from successive stages indicates that the model captures relevant vertebral maturation patterns effectively.

\begin{figure}[H]
    \centering
    \includegraphics[width=0.9\textwidth]{plots/tSNE.png} 
    \caption{t-SNE visualization of the high-level features of images in CVM-900-Subset. The rectangles in different colors represent samples of different CVM stages.}
    \label{fig:tsne}
\end{figure}
\end{comment}

\section{Conclusion}
In conclusion, the analysis of various data-efficient training methods and architectures demonstrates that while in-domain transfer learning improves the performance of the ResNet18 model, its impact is limited, especially regarding relaxed accuracy. The results show that FSL training significantly surpasses both MC training and transfer learning for ResNet18, highlighting its potential for data-efficient strategies in CVM assessment. However, despite being pretrained on a large medical image dataset, the state-of-the-art SAM-Med 2D network did not facilitate data-efficient training for either MC or FSL. Future work will concentrate on addressing the challenges posed by limited datasets and ambiguous label boundaries for CVM assessment by integrating data-efficient FSL with label distribution learning methods.

\begin{figure}[H]
    \centering
    \includegraphics[width=0.5\textwidth]{plots/gradcam_comparison_updated2.pdf} 
    \caption[Grad-CAM visualizations showing model attention]{
        Grad-CAM visualizations showing model attention: The first three rows demonstrate incorrect predictions with diffused and misplaced attention, while the subsequent two rows show correct predictions with focused attention on relevant vertebral regions, where red indicates regions of highest importance for the model's decision.
    }
    \label{fig:grad_cam}
\end{figure}

\clearpage  % Acknowledgements, references, and appendix do not count toward the page limit (if any)
% Acknowledgments---Will not appear in anonymized version
\midlacknowledgments{This research has been funded by the Federal Ministry of Education and Research of
Germany and the state of North-Rhine Westphalia as part of the Lamarr-Institute
for Machine Learning and Artificial Intelligence, LAMARR22B.}


\bibliography{midl25_171}


\appendix

\section{Detailed Result Table}

\begin{sidewaystable}
\centering
\caption{Class-wise performance of different models using various loss functions (CE, BCE, SCL) across six classes (CS1 to CS6).}
\resizebox{\textwidth}{!}{%
\begin{tabular}{c|c|c|c|c|c|c|c}
\toprule
\textbf{Model} & \textbf{Loss} & \textbf{CS1} & \textbf{CS2} & \textbf{CS3} & \textbf{CS4} & \textbf{CS5} & \textbf{CS6} \\ \midrule
\multirow{3}{*}{Resnet18} & CE & $69.05 \pm 13.47$ & $42.22 \pm 11.33$ & $28.21 \pm 15.81$ & $66.67 \pm 6.24$ & $50.00 \pm 7.42$ & $57.89 \pm 21.49$ \\ 
 & BCE & $66.67 \pm 3.37$ & $57.78 \pm 3.14$ & $43.59 \pm 3.63$ & $61.67 \pm 2.36$ & $60.61 \pm 4.29$ & $71.93 \pm 4.96$ \\ 
 & SCL & $61.90 \pm 12.14$ & $57.78 \pm 8.31$ & $48.72 \pm 7.25$ & $66.67 \pm 2.36$ & $68.18 \pm 0.00$ & $64.91 \pm 13.13$ \\ \midrule
Med-ResNet18 & CE & $64.29 \pm 10.10$ & $35.56 \pm 11.33$ & $30.77 \pm 10.88$ & $65.00 \pm 7.07$ & $60.61 \pm 9.34$ & $56.14 \pm 4.96$ \\ \midrule
SAM-Med 2D & CE & $69.05 \pm 3.37$ & $26.67 \pm 10.89$ & $38.46 \pm 6.28$ & $55.00 \pm 4.08$ & $37.88 \pm 2.14$ & $64.91 \pm 4.96$ \\ \bottomrule
\end{tabular}
}
\label{tab:model_performance}
\end{sidewaystable}

\afterpage{\clearpage}

\begin{sidewaystable}[p]
    \centering
    \caption{Performance comparison of different models and FSL methods for k-shot values $k \in \{1,3,5\}$.}
    \resizebox{\textwidth}{!}{%
    \begin{tabular}{l|c|c|c|c|c|c|c}
        \multirow{2}{*}{Model} & \multirow{2}{*}{k} & \multicolumn{2}{c|}{Acc. (\%)} & \multicolumn{2}{c|}{Relaxed Acc. (\%)} & \multicolumn{2}{c}{MAE} \\ \cmidrule{3-8}
        & & BCE & SCL & BCE & SCL & BCE & SCL \\ \midrule
        Resnet18 (Scratch) & - & 46.93 ± 0.46 & - & 83.88 ± 1.27 & - & 0.80 ± 0.06 & - \\ 
        \multirow{3}{*}{FSL w/ Resnet18 (Scratch)} 
        & 1 & 41.75 ± 0.00 & - & 87.38 ± 0.00 & - & 0.71 ± 0.00 & - \\ 
        & 3 & 47.57 ± 0.00 & - & 86.41 ± 0.00 & - & 0.71 ± 0.00 & - \\ 
        & 5 & 51.46 ± 0.00 & - & 90.29 ± 0.00 & - & 0.64 ± 0.00 & - \\ \midrule
        Resnet18 & - & 53.40 ± 1.59 & - & 88.03 ± 0.51 & - & 0.59 ± 0.02 & - \\ 
        \multirow{3}{*}{FSL w/ Resnet18} 
        & 1 & 55.66 ± 3.00 & 61.49 ± 3.00 & 92.56 ± 1.21 & 93.53 ± 1.65 & 0.53 ± 0.04 & 0.47 ± 0.02 \\ 
        & 3 & 62.14 ± 2.10 & 55.66 ± 0.92 & 94.17 ± 2.38 & 93.85 ± 1.65 & 0.46 ± 0.06 & 0.51 ± 0.03 \\ 
        & 5 & 61.17 ± 2.10 & 62.46 ± 4.37 & 92.88 ± 1.21 & 91.59 ± 0.46 & 0.47 ± 0.01 & 0.49 ± 0.03 \\ \midrule
        Med-Resnet18 & - & 57.93 ± 0.92 & - & 89.32 ± 1.59 & - & 0.54 ± 0.03 & - \\ 
        \multirow{3}{*}{FSL w/ Med-Resnet18} 
        & 1 & 49.84 ± 3.99 & 55.02 ± 1.21 & 89.32 ± 3.63 & 90.61 ± 0.92 & 0.62 ± 0.05 & 0.56 ± 0.02 \\ 
        & 3 & 56.31 ± 2.10 & 56.96 ± 3.57 & 89.00 ± 0.92 & 91.26 ± 0.79 & 0.57 ± 0.02 & 0.54 ± 0.05 \\ 
        & 5 & 55.99 ± 0.92 & 56.31 ± 1.59 & 92.23 ± 1.59 & 90.94 ± 2.29 & 0.53 ± 0.03 & 0.55 ± 0.03 \\ \midrule
        SAM-Med 2D & - & 48.87 ± 1.65 & - & 87.06 ± 0.92 & - & 0.67 ± 0.02 & - \\ 
        \multirow{3}{*}{FSL w/ SAM-Med 2D} 
        & 1 & 44.34 ± 1.21 & 49.19 ± 5.16 & 86.41 ± 2.86 & 86.41 ± 0.79 & 0.71 ± 0.02 & 0.66 ± 0.06 \\ 
        & 3 & 46.93 ± 0.92 & 47.90 ± 2.78 & 85.76 ± 2.29 & 86.41 ± 1.59 & 0.71 ± 0.03 & 0.67 ± 0.04 \\ 
        & 5 & 47.90 ± 2.78 & 45.95 ± 0.92 & 85.44 ± 0.79 & 85.11 ± 1.21 & 0.68 ± 0.03 & 0.71 ± 0.03 \\ \bottomrule
    \end{tabular}%
    }
    \label{table:appendix}
\end{sidewaystable}

\end{document}
