\section{Introduction and Related Work}\label{section:intro}
Pathology detection on medical imaging is a cornerstone of modern radiology practice and plays a crucial role in diagnosis and treatment planning. Accurate pathology detection is critical for determining the presence, location, and extent of abnormalities, guiding diagnostic accuracy, enabling targeted treatments, monitoring disease progression, and evaluating therapy efficacy. Simplifying associated workflows and reducing prediction complexity is essential to increase efficiency, remove barriers to clinical adoption, and improve prediction quality. This can be achieved by unifying specialized models into foundation models, and by reducing algorithmic and task complexity. In this context, segmentation algorithms can be substituted by detection-only algorithms for many clinical tasks, such as identifying new metastases or counting the number of existing lesions.

Recent advances in medical imaging have led to the development of foundation models capable of handling multiple modalities and interactive tasks \cite{ma_segment_2024, ma_segment_2024-1}. These models show superior flexibility compared to specialized segmentation networks such as nnU-Net \cite{isensee_nnu-net_2021}, operating on different imaging modalities and accepting visual input prompts. In addition, recent developments have also demonstrated the benefits of language guidance in medical image analysis, enabling more efficient medical image interpretation, with several studies focusing on language-driven segmentation \cite{koleilat_medclip-sam_2024, li_lvit_2023, liu_clip-driven_2023, zhao_biomedparse_2024} and improving zero-shot and few-shot performance \cite{koleilat2024medclip}. Notable work has been done in the area of medical object detection, particularly for brain tumors \cite{mercaldo_object_2023, chen_enhancing_2024, abdusalomov2023brain, he2023cancer}. However, there appears to be a lack of multi-modal and multi-pathology detection frameworks. The success of \ac{G-DINO} \cite{liu_grounding_2024}, an open-set language-guided object detector, has generated interest in its application to medical imaging, allowing the integration of different pathologies, modalities and text prompts into a single network. So far, only a few algorithms \cite{biswas_polyp-sam_2023, xie_simtxtseg_2024, ramesh_lung_2023} have taken advantage of this additional guidance for object detection in medical imaging. Additionally, these works do not investigate \ac{G-DINO} in detail, as a stand-alone architecture, but rather use it as a box prompt generator for SAM, following the idea of Grounded SAM \cite{ren2024groundedsamassemblingopenworld}. In this study we focus on developing a language-guided network to detect pathologies - specifically tumors - of various organs.

Unlike natural images, tumor detection poses unique challenges due to the significant variability in tumor phenotypes across patients, which demands large datasets to achieve generalization. However, the scarcity of annotated medical imaging data makes generalization challenging, especially for detecting smaller tumors \cite{abdusalomov2023brain, he2023cancer}. Foundation segmentation models such as MedSAM \cite{ma_segment_2024} even ignore this issue by entirely excluding pathologies with a volume less than $1000$ pixels and a cross-sectional area less than $100$ pixels. To address these shortcomings, we investigated different \ac{CL} strategies \cite{bengio_curriculum_2009} to increase detection accuracy. \ac{CL}, introduced by Bengio et al. \cite{bengio_curriculum_2009}, has found several applications in the medical domain \cite{jimenez2019medical, wei2021learn, oksuz2019automatic, fischer_progressive_2024} with the goal of improving performance by gradually increasing training complexity. The strategy used in this study, called data-level \ac{CL}, gradually increases the complexity of the training samples: First, the model is trained on large, well-contrasted tumors to establish robust feature representations. Next, the training data is expanded to include smaller, less conspicuous tumors with increasing anatomical and modality variability. This progressive learning approach helps the model develop better generalization capabilities and improves its sensitivity to subtle pathological findings. 

In this work, we explore the potential of  two different \ac{CL} strategies on \ac{G-DINO}'s detection performance by pre-training the network on the TotalSegmentator dataset \cite{wasserthal_totalsegmentator_2023}, followed by \ac{CL}-based fine-tuning on tumor datasets spanning different imaging modalities (\ac{MRI} and \ac{CT}) and anatomical sites (brain metastasis, glioma, liver \& kidney tumor).  In addition, we have developed a pipeline to convert ground truth segmentations into bounding boxes by using morphological operations to consolidate them. This goes beyond the naive approach of drawing tight bounding boxes around segmentations. An extensive evaluation of the \ac{G-DINO} baseline model was conducted, comparing its performance with models trained using two \ac{CL} approaches: teacher \ac{CL} \cite{weinshall2018curriculum} and bounding box \ac{CL} \cite{shi2016weakly}. The results show a 4.9\% improvement in \ac{AP} with teacher \ac{CL} and a 5.2\% increase in \ac{AP} with bounding box \ac{CL} compared to the baseline model. Based on the reviewed literature, this paper is among the initial efforts to:

\begin{enumerate}[nosep] \item Apply two different \ac{CL} strategies to a language-guided detection network (\ac{G-DINO}). \item Train \ac{G-DINO} jointly on different pathologies from various body regions and modalities, demonstrating the model's versatility with limited datasets. \item Develop and formalize a novel preprocessing pipeline to convert medical segmentation datasets into object detection datasets. \end{enumerate}

\begin{figure}[ht]
\centering
\includegraphics[width=0.9\textwidth]{figures/overview.png}
\caption{Overview of our method: As a first step, the natural image \ac{G-DINO} model is pre-trained on the Total Segmentator dataset (top left). In a second step, the baseline is finetuned without \ac{CL} on all pathology datasets (top right). Finally, two \ac{CL} models are trained: for teacher \ac{CL}, the baseline is used to guide the difficulty sorting, while for bounding box \ac{CL}, the size of bounding boxes are used for difficulty sorting (bottom).} \label{fig:overview}
\end{figure}