\section{Method}

Our methodological pipeline consists of three parts, as shown in \figureref{fig:overview}. The first part consists of pre-training the language-guided detection network \ac{G-DINO} on a large multi-modal, multi-organ dataset to detect 163 different regions of interest - 104 for CT and 59 for MRI. The second part consists of fine-tuning the pre-trained \ac{G-DINO} on the pathological datasets producing the baseline. Finally, the third part consists of fine-tuning the pre-trained model on the pathological datasets using two different \ac{CL} strategies.

\subsection{Ground Truth Bounding Box Generation}\label{sec:box_gen}

Open-source medical object detection datasets are even more scarce than segmentation datasets. To leverage the relative abundance of medical segmentation datasets, we devised an efficient method to generate ground truth bounding boxes from existing segmentation data. To generate bounding boxes for the pathology datasets, we first removed all segmentation masks not related to pathologies, such as liver or liver vessel masks. Then, we merged masks if certain tumors had compartments (e.g., ``contrast enhancing'' and ``necrotic'' parts). Finally, we performed dilation on the binary segmentation mask to remove discontinuities in the tumor mask. This provides a more accurate bounding box for a tumor instead of separate bounding boxes for discontinuous regions of the same tumor and ensures that the bounding boxes are not overly tight, providing a more realistic representation similar to human annotation. We perform the dilation in 2 iterations with a $3\times3$ kernel. After performing the dilation, we then drew tight bounding boxes around the resulting segmentations (see \figureref{fig:dilation}). Models like MedSAM, which use oracle bounding boxes as training prompts, address noisy boxes by discarding segmentation masks below a size threshold. Our approach mitigates noise while retaining masks for small tumors.

\begin{figure}[htbp]
\centering
\includegraphics[width=0.77\textwidth]{figures/dilation.png}
\caption{Depiction of the regularization effects of our bounding box pipeline using dilation.} \label{fig:dilation}
\end{figure}

\subsection{Grounding DINO}

\ac{G-DINO} \cite{liu_grounding_2024} is an open object detector capable of identifying any object based on textual input, such as referring expressions or categories. Given an $(\mathrm{Image}, \mathrm{Text})$ pair input, \ac{G-DINO} predicts multiple $(\mathrm{Bounding \; Box}, \mathrm{Noun \; Phrase})$ pairs with confidence scores for each detected entity. The noun phrase is the predicted semantic entity of the box and is derived from the input prompt in an open-set fashion. The model employs a dual-encoder-single-decoder architecture consisting of an image encoder for visual feature extraction, a text encoder for textual information processing, a feature enhancer for fusion of extracted features, a language-guided query selection module for query initialization, and a cross-modality decoder for bounding box refinement \cite{liu_grounding_2024}. We used the \ac{G-DINO} implementation from the mmdetection framework \cite{zhao_open_2024} and adopted the focal loss \cite{focal} ($\gamma = 2.0$ and $\alpha = 0.25$) and the weighted L1 loss ($w = 5$) as loss functions. \cite{zhao_open_2024} For image and text encoders, we used Swin-Tiny \cite{liu_swin} and bert-base-uncased \cite{bert}, respectively. The text prompt is constructed by concatenating all possible class names. Thus, the fine-tuning prompt for all training images was  ``glioma . brain\_metastasis . liver\_tumor . kidney\_tumor``.

\subsection{Curriculum Learning}\label{curriculum}

Weinshall et al. \cite{weinshall2018curriculum} theoretically showed that in linear regression, convergence decreases with increasing sample difficulty. Empirically, they found that in non-convex optimization, higher difficulty increases gradient variance, slowing convergence and worsening generalization compared to \ac{CL}. Based on this, we propose two difficulty-sorting methods: teacher \ac{CL} and bounding box \ac{CL}. For the teacher \ac{CL}, we used the baseline model to perform inference on the entire training dataset of pathological images and computed the \ac{AP} score for each image to sort them into five difficulty levels based on their evaluation scores (ranging from 1 for the easiest to 5 for the most difficult). The baseline model acts as a difficulty grader, assuming high-precision samples are easier to learn, while low-precision ones are more challenging. For false positives, we manually set the AP score to $0.0$ if the network predicts a bounding box with a confidence level greater than $0.3$. In bounding box-based curriculum, difficulty is defined by bounding box size, as multiple studies (including original G-DINO) show precision scores increase with larger boxes. The bounding box curriculum classifies each training sample into one of five difficulty levels based on the size of the smallest bounding box present in the image. We found that bounding box size is a primary indicator of prediction difficulty, as shown by the performance of the baseline model (\tableref{tab:results}). This approach has the advantage of not requiring a trained baseline model a priori. In both \ac{CL} approaches, we randomly assigned training samples without bounding boxes (i.e., no object of interest) to difficulty categories to maintain an equal distribution. To fine-tune the model on the pathological dataset using \ac{CL}, we start with the easiest category and progressively introduce more difficult categories at each epoch. After 5 epochs, the network is fine-tuned on the complete data until convergence. 