\section{Experiments and Results}\label{exp}
\subsection{Pretraining \& Datasets}

In our experiments, we initialized the \ac{G-DINO} architecture with weights published by \cite{zhao_open_2024}, which were trained on several natural image datasets (Objects365 \cite{o365}, GRIT \cite{peng2023kosmos}, V3Det \cite{wang2023v3det}, Golden-G dataset \cite{kamath2021mdetr}). We first pretrained this network on the TotalSegmentator dataset, which consists of \ac{CT} \cite{wasserthal_totalsegmentator_2023} and \ac{MRI} \cite{d2024totalsegmentator} scans of the entire human body, to adapt the image encoder weights to medical imaging modalities and to improve medical semantic understanding. We then fine-tuned the network on heterogeneous datasets spanning multiple modalities, pathologies, hospitals, and scanner manufacturers. This aggregated dataset includes both \ac{MRI} and \ac{CT} scans with four different detection targets: brain metastasis, glioma, liver \& kidney tumor. A detailed overview of all datasets used in this work is given in \tableref{tab:dataset_summary} in the Appendix.

\subsection{Data Preprocessing \& Training Detail}

Both \ac{CT} and \ac{MRI} datasets, originally in 3D NIfTI format, require preprocessing for compatibility with \ac{G-DINO}, a 2D object detector. Following Ma et al. \cite{ma_segment_2024}, we clipped \ac{MRI} images to their $[0.05,99.5]$ percentile and normalized them to $[0,255]$, while \ac{CT} images were windowed (level = 40 and width = 400) before normalization. For the pathological datasets, we used organ segmentation masks to only retain slices containing the organ of the associated pathology. Patients from all datasets were split into train/validation/test ($0.7,0.15,0.15$) sets. The resulting 2D training dataset consisted of 199,672 slices, of which 66,990 slices had bounding box annotations (i.e., tumors were present). During training, the BraTS\_Glioma dataset was undersampled by a factor of 3 to ensure a balanced class distribution. We trained the models on two NVIDIA RTX A6000 GPUs with a batch size of 10 until convergence. The baseline model was trained with a learning rate of $1e^{-5}$ for the first 5 epochs, followed by a learning rate of $1e^{-6}$, while the curricula models were trained with $1e^{-5}$ for the first 7 epochs and $1e^{-6}$ for the remaining epochs to ensure equal data exposure and to compensate for shorter curricula epochs due to data exclusion. \figureref{fig:train_curve} in the appendix illustrates the training loss and validation curves.

\subsection{Curriculum Learning - Difficulty categories}

We categorized the training data into five difficulty levels based on two heuristics to obtain two \ac{CL} strategies as explained in \sectionref{curriculum}. After training the baseline model, we observed that the size of the bounding box correlated with the precision score and thus could be an indicator of the difficulty of the samples (\tableref{tab:results}). To standardize bounding box sizes, we calculated their areas relative to the image area. We then sorted the samples based on the smallest bounding box present in each slice, ensuring an even distribution across categories. Slices without bounding boxes were randomly assigned to maintain an equal distribution of annotated and unannotated samples. For teacher \ac{CL}, we additionally included slices without ground-truth bounding boxes but with false positive predictions from the baseline model and assigned them an \ac{AP} score of 0.0. We then created intervals to maintain a roughly equal distribution across categories. Dataset distributions across categories are shown in the appendix for both heuristics in \tableref{tab:areas} and \tableref{tab:areas_teacher}, respectively.

\input{tables/results_table}

\subsection{Experimental Setup}

After pre-training on the TotalSegmentator dataset, we simultaneously fine-tuned \ac{G-DINO} on the full pathological training dataset to obtain three models: a baseline model and two \ac{CL}-based models using the bounding box and teacher principles introduced in \sectionref{curriculum}. For the baseline, the training data was sampled entirely randomly. All weights were updated during training, including the image and text encoders. During training, model weights were evaluated on a merge of the individual validation sets, and the best performing (mean-\ac{AP} over all detection targets) weights were selected. Subsequent testing was performed on all datasets individually. We evaluated the object detection results using the COCO metrics \cite{lin2014microsoft}. We used \ac{AP}\footnote{Standard deviation cannot be computed for single-class AP scores from a single model, as AP is a single summary value — the area under the precision-recall curve — rather than a distribution.} values at different IoU thresholds: \ac{AP}@0.5 and \ac{AP}@0.75 with IoU thresholds of 0.5 and 0.75, respectively. The unspecified AP represents the average metric across IoU thresholds between 0.5 and 0.95 in 0.05 increments. Additionally, we evaluated the predictions separately for different bounding box sizes, where \ac{AP} small, \ac{AP} medium, \ac{AP} large refer to the \ac{AP} of ground truth bounding boxes with areas of $[0,32^2]$, $[32^2,96^2]$, and $[96^2,\infty)$ pixels, respectively.

\subsection{Results}

Pretraining takes around $4$ days while fine-tuning takes around $2.5$ days for each model. Teacher \ac{CL} additionally requires 2.5 days of baseline training beforehand and approximately 5 hours for evaluating the baseline on the entire training set to assign difficulty classes to each sample. \tableref{tab:results} shows the quantitative test results for all models. \figureref{fig:results} shows the predictions of our three models alongside the ground truth for one case from each dataset. Both \ac{CL} approaches improved performance on average \ac{AP} metrics, with the bounding box \ac{CL} model achieving the highest gains (+5.2\% \ac{AP}, +6.1\% \ac{AP}@0.75, +6.0\% \ac{AP}@0.5 over baseline). Both \ac{CL} models outperformed the baseline in all size-constrained AP scores, with the largest gains in the most difficult categories (\ac{AP} small: +4.7\% for both models, \ac{AP} medium: +3.5\% \& +3.8\% for bounding box \ac{CL} and teacher \ac{CL}, respectively). The results thus support our hypothesis that \ac{CL} improves performance especially for the most difficult samples with the smallest tumors. Looking at individual datasets, the \ac{CL} models performed best in 5 out of 6 datasets for all metrics, with the MSD\_Liver dataset showing slight underperformance in this context (-0.5\% AP for bounding box \ac{CL}, -1.3\% \ac{AP} for teacher \ac{CL}). Overall, the results indicate that \ac{CL} generally improves model performance, especially for challenging detection tasks with small to medium bounding boxes. 
\noindent \figureref{fig:areas} in the Appendix shows the density distribution across categories as the teacher model trains. After training, the distribution shifts toward the “Easiest” category.

\begin{figure}[htbp]
\centering
\includegraphics[width=0.85\textwidth]{figures/results_visualization.png}
\caption{Visualization of the predicted results with an illustrative case from each dataset.} \label{fig:results}
\end{figure}

\subsection{Ablation Studies}

\textbf{Anti-Curriculum}: To show the effect of \ac{CL}, and also to follow recent work \cite{wu2023semi, apan, braun2017curriculum} that proposes a hard-to-easy methodology (anti-\ac{CL}), we also trained our models in such a setting. The results are shown in the appendix in \tableref{tab:results_ablation}. Anti-box \ac{CL} achieves an overall \ac{AP} score of 49.5\%, which is 3.0\% better than the baseline and 2.2\% less than regular bounding box \ac{CL}. Anti-teacher \ac{CL}, on the other hand, achieves an overall \ac{AP} score of 51.2\%, which is 3.7\% better than the baseline and only slightly (0.2\%) worse than regular teacher \ac{CL}. Both \ac{CL} and anti-\ac{CL} proved to be effective for both difficulty sorting approaches and outperformed the baseline. Our results show that when training samples were sorted by difficulty based on the performance of the baseline model, the sorting order had little effect on final accuracy. However, when the difficulty sorting was based on bounding box size, the sample order during training had a more pronounced effect, leading to greater variation in accuracy. This suggests that difficulty sorting based on manual heuristics interacts more with learning dynamics than teacher-based difficulty sorting. 

\noindent\textbf{Finetuning Modality:} In this experiment we try to determine the individual contributions of each modality and the overall benefit of their combination during finetuning. \tableref{tab:results_modality} in the \textcolor{blue}{Appendix} \ref{app:one_mod_finetune} presents test scores for two models, each finetuned on a single modality without \ac{CL}. The results do not indicate a clear advantage of fine-tuning on a single modality versus multiple modalities. As expected, the \ac{CT} model performs poorly on \ac{MRI} datasets and vice versa. When tested on the same modality, its performance is comparable to multi-modal networks \tableref{tab:results}. Specifically, the CT model underperforms compared to the best model across all three CT datasets, while the MRI model achieves performance similar to the multi-modal fine-tuned CL algorithm.

\noindent\textbf{Pretraining:} To evaluate the effect of pretraining, we perform multiple ablations. Firstly, we test the natural image \& medically pretrained \ac{G-DINO} (without fine-tuning) on the pathological datasets to compare their comprehension of pathologies. While both models have scores of $<1\%$ AP across all datasets, a qualitative analysis (see \figureref{fig:pretraining_qualitative} in the \textcolor{blue}{Appendix} \ref{app:test_wo_finetune}) shows, that the pretrained model seems to grasp the concept of a tissue structure better whereas the former is detecting the entire anatomical structure from background. We also fine-tuned two additional bounding box CL models: one pretrained only on MRI scans from TotalSegmentator, and the other only on CT. The results in \tableref{tab:results_pretraining} in the \textcolor{blue}{Appendix} \ref{app:one_mod_pretrain} indicate that the multi-modal pretraining yields better results (51.7 \% AP, \tableref{tab:results}) compared to MRI-only (50.6 \% AP) and CT-only (50.9 \% AP) pretraining.

\noindent\textbf{CL Categories:} To evaluate the effect of the number of difficulty categories employed during CL training, we perform a small experiment by training the bounding box CL model with just two difficulty categories, opposed to five difficulty categories used otherwise. The results - \tableref{tab:results_diff_cat} in \textcolor{blue}{Appendix} \ref{app:two_cat_CL} - indicate that fewer difficulty categories do not increase overall performance: 50.4\% AP score compared to 51.7\% for bounding box CL in \tableref{tab:results}.
