\documentclass{midl} % Include author names

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution

\usepackage{mwe} % to get dummy images

% OUR oWN PACKAGES
\usepackage{booktabs}
\usepackage{multirow}
% \usepackage{subcaption} % midl not allow
% \usepackage{subfigure}
\usepackage{amsmath}  % Required for advanced math formatting
\usepackage{amssymb}  % Provides \mathbb{} (blackboard bold symbols)

\jmlrvolume{-- nnn}
\jmlryear{2026}
\jmlrworkshop{Full Paper -- MIDL 2026}
\editors{Accepted for publication at MIDL 2026}

\title[CDSS-Organ Detection]{Cross-Domain Semi-Supervised  Organ Detection}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor{
  \Name{Nian Li\nametag{$^{1}$}} \Email{
nian.li@tum.de}\\
  \Name{Morteza Ghahremani\nametag{$^{1,2}$}} \Email{morteza.ghahremani@tum.de}\\
  \Name{Bailiang Jian\nametag{$^{1,2}$}} \Email{bailiang.jian@tum.de}\\
  \Name{Pascual Tejero Cervera\nametag{$^{1}$}} \Email{pascual.tejero@@tum.de}\\
  \Name{Benedikt Wiestler\nametag{$^{1,2}$}} \Email{b.wiestler@tum.de}\\
  \Name{Marcus Makowski\nametag{$^{1}$}} \Email{marcus.makowski@tum.de}\\
  \Name{Christian Wachinger\nametag{$^{1,2}$}} \Email{christian.wachinger@tum.de}\\
  \addr $^{1}$Technical University of Munich (TUM), \tabcolsep  $^{2}$Munich Center for Machine Learning (MCML)
}

\begin{document}

\maketitle

\begin{abstract}
Domain adaptation for 3D organ detection in CT imaging is challenging due to variations in scanner types, imaging protocols, and overall acquisition conditions. 
As supervised detection models require large, annotated datasets from diverse scanners and institutions, semi-supervised approaches have gained attention for their ability to leverage limited unlabeled target data. However, traditional semi-supervised methods typically fail to make effective use of the few labeled target samples and most often do not yield satisfactory results. To address this limitation, we introduce a novel cross-domain semi-supervised detection framework (CDSS-Det) built upon the Transformer-based Organ-DETR model. CDSS-Det is a cross-domain semi-supervised framework for 3D organ detection that addresses unreliable pseudo-labels and limited target supervision under domain shift. It introduces a curriculum-guided pseudo-labeling mechanism and domain-robust representation learning to enable effective knowledge transfer from a well-annotated source domain to a sparsely labeled target domain. Experiments on multi-domain CT datasets demonstrate that incorporating a small number of labeled target samples significantly boosts detection performance over conventional domain adaptation and semi-supervised methods. CDSS-Det consistently achieves higher mean Average Precision (mAP), with notable improvements in detecting small organs, and surpasses a fully supervised model trained solely on the labeled target domain by over 10\%. These results underscore the potential of CDSS-Det in efficiently leveraging both labeled and unlabeled target data in cross-domain organ detection, advancing annotation-efficient deep learning models in medical imaging.
\end{abstract}

\begin{keywords}
Organ Detection, Domain Transfer, Domain Adaptation
\end{keywords}

\section{Introduction}

Accurate 3D organ detection from CT scans is essential for disease diagnosis, surgical planning, and downstream applications such as segmentation~\cite{Ma2021AbdomenCT1K}. 
Although deep learning-based object detection models have achieved impressive results on well-annotated datasets~\cite{organdetr}, their generalization to new domains is hindered by substantial domain shifts, arising from variations in scanner types, imaging protocols, and patient demographics. 
%Figure~\ref{fig:domain_gap} illustrates that the domain shift is particularly pronounced between medical datasets, where the transferring a model for object detection between computer vision datasets yields a high mean Average Precision (mAP), whereas transferring between medical datasets results in very low mAP. 
%While deep learning-based object detection models have demonstrated remarkable performance on well-annotated datasets~\cite{organdetr}, their generalization to new domains remains a significant challenge due to substantial domain shifts. These shifts arise from variations in scanner types, imaging protocols, and patient demographics, severely impacting model performance. 
As illustrated in Figure~\ref{fig:domain_gap}, this shift is particularly pronounced in medical imaging, where transferring a model between datasets yields a drastic drop in mean Average Precision (mAP). In contrast, object detection generalizes well on natural image datasets~\cite{adaptiveteacher}. % whereas a transfer works well on computer vision data.

%where transferring an object detection model between standard computer vision datasets results in a high mean Average Precision (mAP), whereas transfer learning between medical datasets leads to a drastic drop in mAP.


%compared to computer vision, 
%This is illustrated in Figure~\ref{fig:domain_gap}, where transferring a model between computer vision datasets yields a high mean Average Precision (mAP), whereas transferring between medical datasets results in very low mAP. 


\begin{figure*}[t]
    \vspace{-2em}
    \centering
    \begin{minipage}[t]{0.48\textwidth}
        \centering
        \includegraphics[width=\textwidth]{fig/domain_gap.pdf}
        \vspace{2pt}
        {\small (a)}
        \label{fig:domain_gap}
    \end{minipage}
    \hfill
    \begin{minipage}[t]{0.48\textwidth}
        \centering
        \includegraphics[width=\textwidth]{fig/method_improvement.pdf}
        \vspace{2pt}
        {\small (b)}
        \label{fig:method_improvement}
    \end{minipage}
    \caption{(a) the mAP result of applying a source-trained model directly to a target domain, highlighting the severe domain gap in medical imaging compared to natural image datasets. (b) CDSS-Det improves cross-domain organ detection on the WORD dataset, outperforming both baseline and fully supervised models.}
    \label{fig:comparison_figures}
    \vspace{-3em}
\end{figure*}

% \begin{figure}[t]
%     \centering
%     \begin{subfigure}[t]{0.48\textwidth}
%         \centering
%         \includegraphics[width=\textwidth]{fig/domain_gap.pdf}
%         \caption{Performance of applying a source-trained model directly to the target data}
%         \label{fig:domain_gap}
%     \end{subfigure}%
%     \hfill
%     \begin{subfigure}[t]{0.48\textwidth}
%         \centering
%         \includegraphics[width=\textwidth]{fig/method_improvement.pdf}
%         \caption{CDSS-Det outperforms the baseline and fully supervised model.}
%         \label{fig:method_improvement}
%     \end{subfigure}
%     \caption{(a) Domain gap severity in medical imaging vs. non-medical datasets. (b) Effectiveness of CDSS-Det in cross-domain organ detection on WORD dataset.}
%     \label{fig:comparison_figures}
% \end{figure}

Furthermore, developing high-performance object detectors requires large-scale labeled datasets, the annotation of which is both resource-intensive and time-consuming. To overcome this challenge, \emph{domain adaptation} has emerged as a promising approach, enabling models trained on a labeled source domain to effectively generalize to a related but distinct target domain~\cite{guan2021domain}. Despite progress in domain adaptation, most prior works focus on unsupervised domain adaptation (UDA)~\cite{SSDA1,SSDA2}. Although no labeled target data is required in UDA methods, they often struggle in medical imaging due to substantial domain shifts caused by variations in image acquisition and patient demographics~\cite{zhang2020generalizing}. 
\emph{Semi-supervised learning} (SSL) has been explored to address the scarcity of labeled data by leveraging unlabeled target data alongside limited labeled samples~\cite{SSL1,SSL2}. However, existing semi-supervised object detection methods in medical imaging often rely solely on labeled source data or unlabeled target data, making it challenging to achieve satisfactory performance due to domain shifts and the lack of direct supervision on the target domain~\cite{SSOD1,SSOD2}. Few-shot learning approaches attempt to mitigate this issue by training on a small set of labeled target samples~\cite{FSL1}, but they fail to utilize the large pool of available unlabeled target data, limiting their ability to generalize effectively. 


Recent studies have explored domain adaptation and semi-supervised learning for medical image analysis, primarily in classification and segmentation tasks. For instance, Yuan et al.~\cite{yuan2024domain} utilized pseudo-labeling for COVID-19 detection, demonstrating that incorporating unlabeled target data can enhance domain transfer in medical classification. In the segmentation domain, Cai et al.~\cite{cai2024class} proposed a Class-Aware Mutual Mixup strategy with triple alignments, while Basak and Yin~\cite{basak2023ssda} introduced a consistency-regularized disentangled contrastive learning approach, both showing the effectiveness of combining alignment and consistency in pixel-level tasks. Similarly, contrastive learning methods such as PixPro~\cite{pixpro} aim to improve feature consistency across domains in self-supervised settings. While these approaches have shown promising results, they are primarily designed for image-level or pixel-wise prediction and do not directly address the challenges of 3D object detection, which requires accurate instance-level localization and handling of class imbalance under domain shift. Moreover, existing methods are not designed for a practically important setting in medical imaging, where a small number of labeled target scans is available alongside abundant unlabeled target data. This mismatch limits their applicability to real-world cross-domain 3D detection scenarios, where effectively leveraging both limited labeled and abundant unlabeled target data is critical.

To address this gap, we propose cross-domain semi-supervised organ detection (CDSS-Det), a framework specifically designed for this practical setting. CDSS-Det builds upon the Transformer-based Organ-DETR model~\cite{organdetr} and introduces a curriculum-controlled pseudo-labeling mechanism tailored for cross-domain 3D medical detection. By explicitly exploiting both limited labeled target data and abundant unlabeled target data, the proposed framework enables stable and effective adaptation under large domain shifts. Experimental results demonstrate that CDSS-Det achieves superior mean Average Precision compared to existing baselines (Figure~\ref{fig:method_improvement}), highlighting the effectiveness of the proposed learning strategy for cross-domain organ detection. Our contributions are summarized below and the source code is publicly available at: \url{https://github.com/ai-med/CDSS-Det}.

% Our contributions are as follows:
% \begin{itemize}
%     \item 
% \end{itemize}

\begin{itemize}
    \item We define a practical cross-domain semi-supervised setting for 3D organ detection, where labeled source data, limited labeled target data, and abundant unlabeled target data are jointly utilized. This setting better reflects real-world medical imaging scenarios and is underexplored in existing literature.

    \item We propose a curriculum-guided pseudo-labeling mechanism that dynamically regulates the contribution of pseudo-label supervision based on model confidence, enabling stable and effective learning under domain shift.

    \item We develop CDSS-Det, a unified framework that incorporates reliability-aware pseudo-label learning and domain-robust representation learning for cross-domain 3D detection. Extensive experiments on two benchmarks demonstrate consistent improvements over strong baselines, with particularly significant gains on small organs.
\end{itemize}




% To illustrate the challenges posed by domain shifts, we present a comparison of mAP scores when applying a source-trained model directly to the target domain without adaptation. As shown in Figure~\ref{fig:domain_gap}, the performance drop underscores the necessity for effective domain adaptation strategies.

\section{Methodology}


\textbf{Preliminaries}. Organ detection in 3D CT imaging involves localizing anatomical structures using axis-aligned bounding boxes and assigning class labels to detected organs~\cite{shin2016organ}. Detection performance is evaluated using mAP at different IoU thresholds, mean Average Recall (mAR), precision, and recall.
We build upon Organ-DETR~\cite{organdetr}, a Transformer-based 3D object detector designed for medical imaging. It introduces MultiScale Attention (MSA) for handling varying organ sizes and Dense Query Matching (DQM) to improve query-object associations, enhancing detection robustness in CT scans. %Given its strong performance, Organ-DETR serves as the foundation for our semi-supervised cross-domain adaptation framework. 

\noindent \textbf{Problem Definition}. We address \emph{cross-domain semi-supervised} organ detection, where a model is trained using a labeled source dataset \( D_s = \{(X_i^s, Y_i^s)\} \), a small set of labeled target samples \( D_t = \{(X_i^t, Y_i^t)\} \), and a larger set of unlabeled target scans \( U_t = \{X_j^t\} \). This setting reflects practical medical imaging scenarios, where limited annotations are available in the target domain despite substantial domain shifts from the source domain. The primary challenge is the domain gap between \( D_s \) and \( D_t \), which can significantly degrade detection performance when models trained on \( D_s \) are directly applied to \( D_t \), as illustrated in Figure~\ref{fig:domain_gap}. While semi-supervised learning provides a natural approach by leveraging unlabeled target data through pseudo-labeling and teacher–student learning~\cite{SSL1}, existing methods are not designed for this cross-domain setting with limited labeled target data, and often suffer from unreliable pseudo-labels and unstable adaptation under domain shift.

%\subsection{Preliminaries}

 
\vspace{-0.5em}

\subsection{CDSS-Det Framework for Cross-Domain Semi-Supervised Learning}

\begin{figure}[t]
    \vspace{-2em}
    \centering
    \includegraphics[width=0.99\textwidth]{fig/arch.pdf}
    \vspace{-1em}
    \caption{Overview of the CDSS-Det framework. Both student and teacher branches are based on Organ-DETR, which incorporates Multi-Scale Attention (MSA) and Dense Query Matching (DQM) to enhance 3D organ detection. The teacher model processes unlabeled target data to generate pseudo-labels, while the student model is trained using labeled source data, labeled target data, and unlabeled target data. Supervised loss is computed from labeled predictions, domain loss is obtained via a discriminator with a gradient reversal layer, and pseudo loss is calculated between student predictions and pseudo labels. The teacher is updated via Exponential Moving Average (EMA) of the student weights.}
    \label{fig:architecture}
    \vspace{-1em}
\end{figure}


Figure~\ref{fig:architecture} provides an overview of the CDSS-Det framework. We consider a cross-domain semi-supervised setting for 3D organ detection, where labeled source data, a small amount of labeled target data, and abundant unlabeled target data are jointly available. To effectively leverage these heterogeneous data sources, we adopt a teacher–student framework, in which the teacher model generates pseudo-labels on unlabeled target data, and the student model is trained using both labeled and pseudo-labeled supervision.

However, directly applying standard teacher–student learning in cross-domain medical imaging is challenging due to substantial domain shifts across datasets, which can lead to unreliable pseudo-labels and unstable training. At the same time, the limited availability of labeled target data restricts the model’s ability to adapt to target-specific characteristics, while naive adaptation may degrade the knowledge learned from the source domain.

To address these challenges, CDSS-Det is designed around three key principles: (1) reliability-aware pseudo-label learning to mitigate noise introduced by domain shift, (2) curriculum-guided supervision balancing to regulate the contribution of pseudo-labels based on model confidence, and (3) domain-robust representation learning to align feature distributions while preserving discriminative knowledge. These components are integrated within a unified teacher–student framework to enable stable and effective cross-domain adaptation for 3D organ detection.

\noindent\textbf{Reliability-Aware Pseudo-Label Learning}: 
To exploit unlabeled target data while mitigating noise introduced by domain shift, pseudo-labels are filtered and refined based on prediction confidence. A pseudo-label is retained only if its classification confidence exceeds a threshold \( \tau \):
\vspace{-1em}
\begin{equation}
\hat{y}_i^t = \arg\max p(y_i | X_i^t), \quad \text{if} \quad p(y_i | X_i^t) > \tau,
\end{equation}
where \( X_i^t \) is the \( i \)-th unlabeled target sample, \( y_i \) represents the ground-truth class label, and \( \hat{y}_i^t \) is the assigned pseudo-label. To eliminate redundant detections, we further apply IoU-based Non-Maximum Suppression (NMS). 

Due to domain shift, bounding box predictions from pseudo-labels may remain unreliable. Therefore, instead of applying regression losses, we use classification-only supervision for pseudo-labels. The pseudo-label loss is defined as:
\vspace{-1em}
\begin{equation}
\mathcal{L}_{\text{pseudo}} = \frac{1}{N_p} \sum_{i=1}^{N_p} \mathcal{L}_{\text{CE}}(y_i^t, \hat{y}_i^t).
\end{equation}


\noindent\textbf{Curriculum-Guided Supervision Balancing}: 
In cross-domain settings, pseudo-labels are inherently less reliable than labeled data due to domain shift, and directly applying them with fixed weighting can lead to confirmation bias and unstable optimization. This issue is particularly pronounced in 3D medical detection, where small anatomical variations can significantly affect prediction confidence.

To address this, we design a curriculum mechanism that dynamically regulates the contribution of pseudo-label supervision based on the model’s confidence on labeled data. Instead of treating pseudo-labels as equally reliable throughout training, the model gradually increases their influence only when it demonstrates sufficient confidence on labeled samples. Formally, the pseudo-label weight is updated as:
\vspace{-1em}
\begin{equation}
\lambda_{\text{pseudo}}(t) =
\begin{cases} 
\min(\lambda_{\text{pseudo}}(t-1) + \Delta, \lambda_{\text{max}}), & \text{if } \mathcal{L}_{\text{cls}} < \delta \\  
\max(\lambda_{\text{pseudo}}(t-1) - \Delta, \lambda_{\text{min}}), & \text{if } \mathcal{L}_{\text{cls}} \geq \delta.
\end{cases}
\end{equation}

\noindent If the student classification loss on labeled data falls below a threshold \( \delta \), the pseudo-label weight is increased by \( \Delta \), up to \( \lambda_{\text{max}} \). Otherwise, it is decreased but bounded by \( \lambda_{\text{min}} \). 

This design introduces a feedback-driven curriculum that adapts to the learning state of the model, ensuring that pseudo-label supervision is introduced progressively and remains bounded. As a result, the model avoids over-reliance on noisy pseudo-labels in early stages while fully exploiting unlabeled data once reliable representations are learned. This mechanism is particularly effective in scenarios with limited labeled target data, where balancing supervised and pseudo-supervised signals is critical for stable adaptation.


\noindent\textbf{Domain-Robust Representation Learning}:   
To address both domain discrepancy and the risk of forgetting source knowledge, we combine adversarial alignment with a replay strategy. A domain discriminator is attached to backbone features and trained to distinguish between source and target domains, while the student model learns domain-invariant representations through a gradient reversal layer (GRL)~\cite{zhang2020generalizing}. 

At the same time, we incorporate a replay mechanism inspired by continual learning~\cite{continualreplay}. Instead of randomly sampling source data, we identify and replay the hardest labeled source samples based on detection confidence. These informative samples are jointly trained with target data, reinforcing discriminative features and stabilizing adaptation. The combination of adversarial alignment and targeted replay enables more robust feature learning under domain shift.

\noindent\textbf{Teacher–Student Optimization}:  
The teacher–student framework is optimized jointly using supervised, pseudo-label, and domain alignment objectives. The teacher model is maintained as an Exponential Moving Average (EMA) of the student model parameters:
\vspace{-0.5em}
\begin{equation}
\theta_t \leftarrow \alpha \theta_t + (1 - \alpha) \theta_s,
\end{equation}
where \( \theta_t \) and \( \theta_s \) denote the teacher and student parameters, respectively, and \( \alpha \) is the EMA decay factor. This design stabilizes pseudo-label generation and provides a consistent training signal.

The overall training objective for the student integrates supervised learning on labeled data, pseudo-label learning on unlabeled data, and domain alignment:
\vspace{-0.5em}
\begin{equation}
\mathcal{L}_{\text{student}} = \mathcal{L}_{\text{sup}} + \lambda_{\text{pseudo}}(t) \mathcal{L}_{\text{pseudo}} + \lambda_{\text{domain}} \mathcal{L}_{\text{domain}}.
\end{equation}

\noindent The supervised loss is computed on both labeled source and labeled target data:
\vspace{-0.5em}
\begin{equation}
\mathcal{L}_{\text{sup}} = \mathcal{L}_{\text{sup}}^{\text{src}} + \mathcal{L}_{\text{sup}}^{\text{tgt}},
\end{equation}

\noindent where each term consists of classification, localization, and segmentation components:
\vspace{-0.5em}
\begin{equation}
\mathcal{L}_{\text{sup}}^{(\cdot)} = \mathcal{L}_{\text{cls}}^{(\cdot)} + \mathcal{L}_{\text{bbox}}^{(\cdot)} + \mathcal{L}_{\text{giou}}^{(\cdot)} + \mathcal{L}_{\text{seg}}^{(\cdot)}.
\end{equation}

\noindent Here, \( \mathcal{L}_{\text{cls}}^{(\cdot)} \) denotes classification loss, \( \mathcal{L}_{\text{bbox}}^{(\cdot)} \) is L1 bounding box regression loss, \( \mathcal{L}_{\text{giou}}^{(\cdot)} \) is Generalized IoU loss, and \( \mathcal{L}_{\text{seg}}^{(\cdot)} \) is an optional segmentation loss composed of cross-entropy and Dice losses. Segmentation supervision is applied only to labeled data, following the original Organ-DETR design, and serves as an auxiliary signal to enhance multi-scale feature learning. It is not used for unlabeled data, pseudo-label generation, or inference.

\begin{equation}
\mathcal{L}_{\text{seg}}^{(\cdot)} = \mathcal{L}_{\text{ce}}^{(\cdot)} + \mathcal{L}_{\text{dice}}^{(\cdot)}.
\end{equation}

\noindent The curriculum-controlled weight \( \lambda_{\text{pseudo}}(t) \) regulates pseudo-label influence based on model confidence, while \( \lambda_{\text{domain}} \) scales the domain alignment objective. Together, these components enable stable and effective integration of labeled and unlabeled data for cross-domain 3D organ detection.


\section{Experiments}
% \subsection{Datasets}

\textbf{Datasets}.We evaluate our method on two cross-domain 3D organ detection settings. 
The first setting, AbdomenAtlas $\rightarrow$ TotalSegmentator, uses AbdomenAtlas, a large-scale, multi-center dataset with annotations for multiple abdominal organs~\cite{qu2023abdomenatlas}. 
We use AbdomenAtlas 1.0 with 5,195 scans, where 3,524 scans were used as the training set to pre-train our model. 
To the best of our knowledge, this is the first study utilizing AbdomenAtlas for cross-domain, semi-supervised organ detection. 
The scans in this dataset include both healthy and diseased organs such as tumors and fatty liver. 
Axis-aligned bounding boxes are extracted from segmentation maps and used as detection labels. 
Scans are normalized using the 0.5 and 99.5 percentiles of non-background voxels, clipped to the [0, 1] range. 
Augmentations (applied with 50\% probability) include random intensity scaling/shifting (up to 10\%), rotation ($\pm$5$^\circ$), translation (up to 10\%), and zooming ($\pm$10\%). 

In the target dataset, TotalSegmentator~\cite{Wasserthal2022TotalSegmentator}, the training set contains 113 scans, which are split into 8 labeled target scans and 105 unlabeled target scans for semi-supervised training.
In addition, 21 scans are used for validation and 29 scans for testing, which are not included in the training set.
Eight common organs from these two datasets are selected for our detection task.
The TotalSegmentator dataset also includes both healthy and pathological cases.

The second setting, AbdomenCT-1K $\rightarrow$ WORD, involves AbdomenCT-1K, a dataset of 1,112 high-resolution 3D CT scans from five sources, covering the liver, left kidney, right kidney, spleen, and pancreas~\cite{Ma2021AbdomenCT1K}. 
These scans exhibit variability in slice thickness and pixel spacing, making them suitable for cross-domain adaptation. 
732 samples in the training set are used to pre-train the model. 
The scans contain both healthy and diseased organs, including cancer and tumors. 
The same normalization and augmentation techniques as above are applied to improve robustness.

The target dataset, WORD, contains 150 CT scans acquired from a single medical center with high-resolution imaging and multiple organ annotations~\cite{Miao2021WORD}. 
We use 31 labeled target scans and 75 unlabeled target scans for training, with 14 validation scans and 29 test scans.
Five common organs from these two datasets are selected for our detection task.
Similar to other datasets, bounding boxes are derived from segmentation masks.

An overview of dataset splits and organ size definitions is provided in Table~\ref{tab:dataset_split_size}.


\begin{table*}[t]
    \centering
    \caption{Dataset splits and organ size definitions for the two cross-domain settings.
    ``Labeled'' and ``Unlabeled'' refer to training data only.}
    \label{tab:dataset_split_size}
    \begin{tabular}{lccccp{4.8cm}}
    \toprule
    Dataset 
    & Labeled Train 
    & Unlabeled Train 
    & Val 
    & Test 
    & Organ size definitions \\ 
    \midrule
    TotalSeg 
    & 8 
    & 105 
    & 21 
    & 29 
    & \textbf{Large}: gallbladder, pancreas \newline
      \textbf{Medium}: left kidney, right kidney, spleen, aorta \newline
      \textbf{Small}: stomach, liver \\
    \midrule
    WORD 
    & 31 
    & 75 
    & 14 
    & 29 
    & \textbf{Large}: pancreas \newline
      \textbf{Medium}: left kidney, right kidney, spleen \newline
      \textbf{Small}: liver \\
    \bottomrule
    \end{tabular}
\end{table*}





% \subsection{Training and Evaluation Setup}

% \noindent\textbf{Training and Evaluation Setup}. The pseudo-labels are filtered using a confidence threshold of 0.8 and refined with Non-Maximum Suppression (NMS) using an IoU threshold of 0.5. The teacher model is updated via Exponential Moving Average (EMA) with a decay factor of 0.9996. Training is performed using the AdamW optimizer with a weight decay of \(1 \times 10^{-4}\), and an initial learning rate of \(2 \times 10^{-4}\), which decays by a factor of 0.1 every 500 epochs. The model is trained for a total of 2,500 epochs, with each iteration processing one labeled source, one labeled target, and one unlabeled target sample.

% The supervised loss is composed of classification, bounding box regression, and optional segmentation terms, with loss weights set to: classification (cls) = 2, bounding box (bbox) = 5, generalized IoU (giou) = 2, segmentation cross-entropy (segce) = 2, and segmentation Dice (segdice) = 2. The pseudo-label loss weight \( \lambda_{\text{pseudo}}(t) \) is dynamically adjusted by ±0.1 depending on whether the student classification loss falls below or exceeds a threshold of 0.01, and is constrained between a minimum of 0 and a maximum of 2. 

% Domain adaptation loss is incorporated into the total objective with a fixed weight of 0.2. To mitigate forgetting in the source domain, a replay strategy is employed where the number of hard source samples selected equals the number of labeled target samples. All experiments are conducted on an NVIDIA A100 GPU with 80 GB of memory.


\noindent\textbf{Training and Evaluation Setup}.  
Training uses AdamW (weight decay $1\times10^{-4}$) with an initial learning rate of $2\times10^{-4}$, decaying by 0.1 every 500 epochs, for a total of 2,500 epochs. 
Each iteration processes one labeled source, one labeled target, and one unlabeled target sample. 
The supervised loss includes classification, bounding-box regression, and segmentation, with weights: cls = 2, bbox = 5, giou = 2, segce = 2, segdice = 2. 
Pseudo-labels are filtered with a confidence threshold of 0.8 and refined via NMS (IoU = 0.5). 
The teacher model is updated using EMA (decay = 0.9996).
The pseudo-label weight $\lambda_{\text{pseudo}}(t)$ adjusts by $\pm 0.1$ based on a classification-loss threshold of 0.01, and is clipped to $[0, 2]$.
These parameter choices follow general design principles for semi-supervised learning in 3D medical CT detection, where pseudo-label supervision should be bounded relative to labeled supervision and should not dominate it. 
We validate this design across two substantially different cross-domain settings: two large-scale source datasets used for pre-training (AbdomenAtlas and AbdomenCT-1K), and two target domains with markedly different levels of labeled supervision (31 labeled scans in WORD versus only 8 labeled scans in TotalSegmentator). 
The consistent effectiveness across these settings supports the robustness and general applicability of the proposed pseudo-label confidence threshold and curriculum strategy, rather than sensitivity to precise parameter values. Domain adaptation loss is added with weight 0.2. 
A replay strategy selects hard source samples with the lowest detection performance (measured by mAP) from the source-trained model, and replays as many of these samples as labeled target samples during training. 
All experiments are conducted on an NVIDIA A100 GPU (80 GB).

To evaluate CDSS-Det, we compare it against multiple baselines, including a baseline model trained solely on labeled target data, a pre-trained variant initialized with a source-trained model, and a fully supervised (Full Sup.) model trained with full target annotations. Additionally, we conduct an ablation study to assess the contribution of different components within CDSS-Det.

Recent cross-domain semi-supervised methods in classification~\cite{yuan2024domain} and segmentation~\cite{cai2024class, basak2023ssda} have shown encouraging results, but they lack publicly available source code and implementation details, making direct comparisons infeasible. Furthermore, methods such as~\cite{basak2023ssda}, which focus on 2D segmentation, are difficult to adapt to 3D object detection due to differences in task formulation and model design. As an alternative, we include PixPro~\cite{xie2021pixpro} as an additional consistency constraint, implemented as a separate loss to encourage feature consistency. We empirically determine its optimal coefficient to be 0.01 in our setting. Note that all reported results are obtained using the student model during inference.


% \subsection{Experimental Results}


\begin{table*}[t]
    \vspace{-2em}
    \centering
    % \setlength{\tabcolsep}{4pt}
    \caption{Detection performance on the WORD and TotalSegmentator datasets under different training strategies. The baseline is trained solely on labeled target data. The Pre-trained model is initialized with a source-trained model to improve generalization. 
    % CDSS-Det further integrates a replay strategy, domain alignment, and dynamic pseudo-labeling. 
    The Full Sup. assumes access to all target data with full annotations. We report mAP grouped by organ size for small (S), medium (M), and large (L) organs.}
    \begin{tabular}{cccccccccccc}
    \toprule
     \multirow{2}{*}{Dataset}
    & \multirow{2}{*}{Method}
    & \multicolumn{3}{c}{mAP $\uparrow$}
    & \multicolumn{3}{c}{mAR $\uparrow$}
    & \multicolumn{3}{c}{mAP $\uparrow$ by size}
    \\
    \cmidrule(lr){3-5} \cmidrule(lr){6-8} \cmidrule(lr){9-11}
    & &  Total & $75\%$ & $50\%$ & Total & $75\%$ & $50\%$ & S & M & L \\\midrule
    \multirow{4}{*}{WORD} 
         & Baseline & 47.2 & 44.7 & 96.5 & 54.0 & 57.9 & 97.2 & 26.2 & 50.4 & 58.7 \\
        & Pre-trained & 72.9 & 85.6 & 97.5 & 77.7 & 89.0 & \textbf{98.6} & 54.0 & 77.2 & 79.0 \\
        & Full Sup. & 65.4 & 78.1 & \textbf{98.1} & 71.1 & 87.1 & \textbf{98.6} & 40.4 & 71.4 & 72.6 \\
        & CDSS-Det & \textbf{76.6} & \textbf{88.8} & 97.4 & \textbf{79.9} & \textbf{91.7} & 97.9 & \textbf{58.4} & \textbf{80.7} & \textbf{82.7} \\
        \midrule
        
        \multirow{4}{*}{TotalSeg}  
        & Baseline & 13.7 & 2.1 & 50.7 & 21.0 & 8.2 & 62.5 & 6.0 & 13.9 & 21.0 \\
        & Pre-trained & 62.5 & 68.1 & \textbf{94.5} & 68.2 & 75.5 & 95.9 & 32.1 & 76.6 & 64.7 \\
        &  Full Sup. & 52.7 & 62.7 & 88.6 & 59.2 & 69.9 & 91.8 & 20.5 & 64.4 & 61.6 \\
        & CDSS-Det & \textbf{70.1} & \textbf{82.2} & 94.1 & \textbf{75.2} & \textbf{85.8} & \textbf{96.3} & \textbf{42.2} & \textbf{81.7} & \textbf{74.5} \\
    \bottomrule
    \end{tabular}
    \label{tab:comparative_results}
    \vspace{-1em}
\end{table*}




\noindent\textbf{Experimental Results}. Table~\ref{tab:comparative_results}  summarizes the detection performance of different training strategies on the WORD and TotalSegmentator datasets. CDSS-Det consistently achieves the highest detection performance across different IoU thresholds and organ sizes, demonstrating its effectiveness in leveraging labeled source, labeled target, and unlabeled target data for cross-domain semi-supervised organ detection. 
The baseline model, trained only on labeled target data, achieves the lowest performance on both datasets, highlighting the challenges of learning from limited labeled data in the target domain. Pre-training on the source dataset significantly improves performance, confirming the importance of transferring knowledge from a larger labeled dataset. Remarkably, the Full Sup. model, despite full supervision, falls short of CDSS-Det, indicating that refined pseudo-labeling can effectively supplement sparse annotations and improve generalization.
%Despite having full supervision, the Full Sup. model performs worse than CDSS-Det, suggesting that pseudo-labeling, when properly refined, can effectively supplement limited annotations and improve generalization.
Figure~\ref{fig:vis_results} presents a qualitative comparison of organ detection results between Full Sup. and CDSS-Det. 
%This is further illustrated in Figure~\ref{fig:vis_results}.


In clinical practice, 3D organ detection is often used as a localization or initialization step, such as region-of-interest cropping or as a precursor to downstream segmentation, where an IoU threshold around 0.5 is typically sufficient. 
At the same time, higher IoU thresholds (e.g., 0.75) are important for robust automation, reducing downstream correction effort, and accurately localizing small or anatomically variable organs. 
As shown in Table~\ref{tab:comparative_results}, CDSS-Det consistently improves performance across IoU thresholds, achieving gains not only at 75\% IoU but also at 50\% IoU. 
These improvements are particularly pronounced for small and medium organs, which are most sensitive to localization errors under domain shift, while performance on large organs is preserved.


CDSS-Det provides notable gains, particularly for small organs, which are traditionally difficult to detect due to their anatomical variability and limited representation in training data. As shown in Figure~\ref{fig:ablations}, CDSS-Det significantly outperforms the Full Sup. model across all organ sizes, with the largest improvements observed on small organs in both WORD and TotalSegmentator datasets. These results demonstrate that pseudo-label refinement and dynamic weighting are especially effective in addressing the challenges of detecting underrepresented structures. 

We note that the relative clinical importance of large, medium, and small organs can vary across applications. Importantly, CDSS-Det does not trade off large-organ performance to achieve gains on smaller organs: performance on large organs remains comparable to or better than competing methods across both datasets. This indicates improved robustness rather than a bias toward specific organ sizes. Moreover, the proposed framework is flexible, as loss weighting or evaluation emphasis can be adjusted to align with task-specific clinical priorities when required.




% \begin{figure}[t]
%     \centering
%     \includegraphics[width=0.6\linewidth]{fig/organ_size.png}
%     \caption{mAP gain of CDSS-Det over the Full Supervised model across different organ sizes (Small, Medium, Large) on the WORD and TotalSegmentator datasets. CDSS-Det achieves substantial improvements on small organs, highlighting its effectiveness in handling anatomically variable and low-frequency classes.}
%     \label{fig:organ_size}
%     \vspace{-1cm}
% \end{figure}



Recent works in medical domain adaptation and semi-supervised learning, such as Yuan et al.~\cite{yuan2024domain}, Cai et al.~\cite{cai2024class}, and Basak et al.~\cite{basak2023ssda}, have explored related ideas in the context of classification and segmentation. However, these methods focus on image-level classification~\cite{yuan2024domain} or pixel-wise segmentation~\cite{cai2024class, basak2023ssda}, and do not address the instance-level challenges inherent in 3D object detection. In addition, they lack open-source implementations and sufficient details for reproducibility, and methods designed for 2D segmentation tasks are not straightforward to adapt to volumetric 3D detection problems. As a result, we include PixPro~\cite{pixpro} as a representative self-supervised learning baseline in our ablation study to evaluate the potential of pixel-level consistency in 3D detection tasks.


\begin{figure}[t]
    \centering
    \vspace{-2em}
    \includegraphics[width=0.7\linewidth]{fig/vis.pdf}
    \caption{Visualization of organ detection results on WORD dataset. The left shows results from Full Sup. and the right presents results from CDSS-Det. Ground truth bounding boxes are in green, and predicted bounding boxes are in yellow.}
    \label{fig:vis_results}
    \vspace{-2em}
\end{figure}

\begin{figure}[t]
    \centering
    \begin{minipage}[t]{0.3\linewidth}
        \centering
        \includegraphics[width=\linewidth]{fig/organ_size.png}
        
        {\small (a) }
        \label{fig:organ_size}
    \end{minipage}
    \hfill
    \begin{minipage}[t]{0.33\linewidth}
        \centering
        \includegraphics[width=\linewidth]{fig/pseudo_coef.png}
        % {\small (b) Effect of different fixed pseudo-label classification loss coefficients compared to a dynamic weighting strategy.}
        {\small (b)}
        \label{fig:pseudo_coef}
    \end{minipage}
    \hfill
    \begin{minipage}[t]{0.3\linewidth}
        \centering
        \includegraphics[width=\linewidth]{fig/weak_strong_aug.png}
        {\small (c)}
        % {\small (c) Comparison of mAP with and without weak-strong augmentation applying to CDSS-Det.}
        \label{fig:weak_strong}
    \end{minipage}
    \caption{(a) mAP gain of CDSS-Det over the Full Supervised model across different organ sizes (Small, Medium, Large) on the WORD and TotalSegmentator datasets. (b) Ablation study on the pseudo-label loss coefficient strategies, and (c) Ablation study on the weak-strong augmentation impact.}
    \label{fig:ablations}
    \vspace{-2em}
\end{figure}



% \begin{figure}[t]
%     \centering
%     \includegraphics[width=\linewidth]{fig/organ_size.png}
%     \caption{mAP gains from the dynamic pseudo labeling strategy for small, medium, and large organs.}
%     \label{fig:organ_size}
% \end{figure}

Table \ref{tab:ablation_results} reports results for the ablation study that evaluated 
%The ablation study further evaluates 
the contribution of individual components within CDSS-Det. The replay strategy improves feature stability by retaining harder samples from the source domain. Domain adaptation provides an additional performance boost by reducing feature discrepancies between source and target distributions, but its impact remains relatively small compared to pseudo-labeling. Self-supervised feature consistency with PixPro does not yield substantial improvements, indicating that contrastive learning may be less effective for volumetric medical imaging. Overall, the highest gains come from pseudo-labeling and dynamic weighting, which allow the model to gradually incorporate pseudo-labels without introducing excessive noise. Note that CDSS-Det corresponds to the last row in the table. 

To investigate the impact of curriculum learning, we compare CDSS-Det using fixed pseudo-labeling loss coefficients against our dynamic curriculum strategy. As shown in Figure~\ref{fig:ablations}, using fixed coefficients leads to unstable performance, with mAP values fluctuating between 74.7 and 75.6 in WORD dataset and fluctuating between 66.8 and 68.5 in TotalSegmentator dataset depending on the coefficient. Notably, large coefficients such as 1.0 and 2.0 result in degraded performance in both two datasets due to training divergence. In contrast, our curriculum-based dynamic weighting strategy achieves a significantly higher mAP of 76.6 in WORD dataset and 70.1 in TotalSegmentator dataset, demonstrating its ability to balance supervision and mitigate overfitting or label noise during training. This highlights that dynamic pseudo loss weighting is crucial for stable and effective semi-supervised learning in medical detection scenarios.

% \begin{figure}[t]
%     \centering
%     \includegraphics[width=\linewidth]{fig/pseudo_coef.png}
%     \caption{Effect of different fixed pseudo-label classification loss coefficients compared to a dynamic weighting strategy.}
%     \label{fig:pseudo_coef}
% \end{figure}

To investigate the effect of weak-strong augmentation in the context of medical image detection, we compare CDSS-Det's performance with and without this strategy. As shown in Figure~\ref{fig:ablations}, applying weak-strong augmentation decreases the mAP from 76.6 to 75.0 in WORD dataset and decreases from 70.1 to 68.3 in TotalSegmentator dataset, contrary to trends observed in natural image domains. For instance, in Adaptive Teacher~\cite{adaptiveteacher}, weak-strong augmentation significantly boosts detection performance across various cross-domain scenarios. However, in our experiments in 3D CT data, this technique appears to hinder performance. One possible explanation is that aggressive augmentation may distort subtle anatomical structures and degrade the quality of pseudo labels, especially when the model already generates high-quality predictions under weak augmentation alone. This highlights a critical difference between medical and natural image domains, where preserving spatial fidelity is often more important than encouraging invariance through strong perturbations.


\begin{table*}[t]
\vspace{-2em}
\centering
\caption{Ablation study on the WORD and TotalSegmentator datasets. Configuration settings include Replay (R), Domain Alignment (D), PixPro (P), Pseudo-Labeling (PL), and Dynamic Pseudo-Labeling (Dyn).}
\setlength{\tabcolsep}{3.5pt} % Reduce column spacing further
% \renewcommand{\arraystretch}{1.1} % Adjust row height slightly
\begin{tabular}{c ccccc ccc ccc ccc}
    \toprule
    & \multicolumn{5}{c}{Configuration} & \multicolumn{3}{c}{mAP$\uparrow$} & \multicolumn{3}{c}{mAR$\uparrow$} & \multicolumn{3}{c}{mAP $\uparrow$ by size} \\
    \cmidrule(lr){2-6} \cmidrule(lr){7-9} \cmidrule(lr){10-12} \cmidrule(lr){13-15}
     & R & D & P & PL & Dyn & Total & 75\% & 50\% & Total & 75\% & 50\% & S & M & L \\
    \midrule
    \multirow{5}{*}{\rotatebox{90}{WORD}}  
    & \checkmark &  &  &  &  & 74.5 & 87.7 & 96.1 & 79.7 & \textbf{92.4} & 97.9 & 52.4 & 78.7 & \textbf{84.2} \\
    & \checkmark & \checkmark &  &  &  & 74.7 & 86.2 & \textbf{97.5} & 79.6 & 91.0 & \textbf{98.6} & 52.7 & 79.6 & 82.0 \\
    & \checkmark & \checkmark & \checkmark &  &  & 74.8 & 87.7 & 96.4 & 79.5 & 91.7 & 97.9 & 55.3 & 79.5 & 80.2 \\
    & \checkmark & \checkmark &  & \checkmark &  & 75.6 & 88.2 & 97.4 & 79.5 & 91.7 & 97.9 & 56.6 & 80.0 & 81.3 \\
    & \checkmark & \checkmark &  & \checkmark & \checkmark & \textbf{76.6} & \textbf{88.8} & 97.4 & \textbf{79.9} & 91.7 & 97.9 & \textbf{58.4} & \textbf{80.7} & 82.7 \\
    \midrule
    \multirow{5}{*}{\rotatebox{90}{TotalSeg}}  
    & \checkmark &  &  &  &  & 67.4 & 78.8 & 92.7 & 73.2 & 84.3 & 95.1 & 40.3 & 78.5 & 72.3 \\
    & \checkmark & \checkmark &  &  &  & 68.0 & 78.8 & 94.2 & 73.8 & 83.8 & 95.9 & 39.3 & 79.9 & 73.0 \\
    & \checkmark & \checkmark & \checkmark &  &  & 68.1 & 78.7 & 93.9 & 73.1 & 83.4 & 95.1 & 39.9 & 80.2 & 72.3 \\
    & \checkmark & \checkmark &  & \checkmark &  & 68.5 & 80.1 & \textbf{94.2} & 74.2 & 85.1 & 96.0 & 41.7 & 78.9 & 74.3 \\
    & \checkmark & \checkmark &  & \checkmark & \checkmark & \textbf{70.1} & \textbf{82.2} & 94.1 & \textbf{75.2} & \textbf{85.8} & \textbf{96.3} & \textbf{42.2} & \textbf{81.7} & \textbf{74.5} \\
    \bottomrule
\end{tabular}
\label{tab:ablation_results}
\vspace{-1em}
\end{table*}

% \begin{figure}[t]
%     \centering
%     \includegraphics[width=\linewidth]{fig/weak_strong_aug.png}
%     \caption{Comparison of mAP with and without weak-strong augmentation applying to CDSS-Det.}
%     \label{fig:weak_strong}
% \end{figure}


Overall, our results demonstrate that CDSS-Det consistently outperforms a range of strategies, from models trained solely on labeled target data and those leveraging pre-training, to fully supervised approaches. By employing curriculum-guided pseudo-labeling and reliability-aware supervision, CDSS-Det effectively mitigates domain shifts and improves cross-domain generalization.

%Overall, our results underscore that CDSS-Det redefines cross-domain organ detection by effectively merging labeled and unlabeled data. Unlike fully supervised methods, our approach adeptly navigates domain shifts and captures the nuances of small organ detection through refined pseudo-labeling and dynamic weighting. %This demonstrates the promise of semi-supervised learning in overcoming limited annotations and adapting robustly to diverse clinical scenarios.

%CDSS-Det achieves the best overall detection performance on both datasets, surpassing the Full Sup. model. The results suggest that a well-designed semi-supervised learning approach can make efficient use of both labeled and unlabeled data, leading to improved generalization and better adaptation to domain shifts.


\vspace{-1em}
\section{Conclusion}

% We demonstrated that the introduced CDSS-Det effectively leverages labeled and unlabeled data for cross-domain 3D organ detection, surpassing both pre-trained and fully supervised models. Pseudo-label refinement contributes the most to performance gains, while replay and domain adaptation further enhance generalization. 
% These findings underscore the potential of semi-supervised learning not only in reducing annotation efforts but also in improving organ detection robustness.
%The results highlight the potential of semi-supervised learning in reducing annotation needs and improving detection across domains.

We demonstrated that CDSS-Det effectively leverages labeled and unlabeled data for cross-domain 3D organ detection, surpassing both pre-trained and fully supervised models. Pseudo-label refinement contributes the most to performance gains, while replay and domain adaptation further enhance generalization. 
%These findings underscore the potential of semi-supervised learning not only in reducing annotation efforts but also in improving organ detection robustness. Moreover, our approach mitigates domain shifts commonly encountered in medical imaging, making it a promising solution for real-world clinical applications. 
These findings highlight the potential of semi-supervised learning not only to reduce annotation efforts and enhance detection robustness but also to address the domain shifts inherent in medical imaging.
By facilitating reliable domain transfer, our approach takes a crucial step toward translating organ detection approaches into clinical practice. Future work will explore extending this framework to additional imaging modalities and anatomical structures to further validate its effectiveness.

\vspace{-1em}
\section{Acknowledgments}
The authors gratefully acknowledge the computational and data resources provided by the Leibniz Supercomputing Centre (https://www.lrz.de).

\clearpage  % Acknowledgements, references, and appendix do not count toward the page limit (if any)
% Acknowledgments---Will not appear in anonymized version
% \midlacknowledgments{We thank a bunch of people.}


\bibliography{midl26_10}

% \appendix
% \section{Code and Datasets}
% The code and the datasets are available in:
% \begin{links}
%     \link{Code}{https://github.com/CodeForAAAI2026/CDSS-Det}
%     % \link{Datasets}{https://github.com/CodeForAAAI2026/CDSS-Det}
% \end{links}

\end{document}
