\documentclass[table]{midl}
 % Include author names
%\documentclass[anon]{midl} % Anonymized submission
% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution
\usepackage{xcolor}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{booktabs}
\usepackage{mwe} % to get dummy images
\usepackage{multirow}
\usepackage{booktabs}
\usepackage{colortbl}
\usepackage{rotating}
\usepackage{makecell}
\usepackage{array}
\usepackage{enumitem}


\usepackage{pifont}
\newcommand{\xmark}{\ding{55}}

\newcommand{\transformer}{Reducing Uncertainty in 3D Medical Image Segmentation under Limited Annotations through Contrastive Learning}
\newcommand\figref{Figure~\ref}
\newcommand\tabref{Table~\ref}
\newcommand{\mathdash}[1]{{\operatorname{#1}}}

\jmlryear{2024}
\jmlrworkshop{Full Paper -- MIDL 2024}
\jmlrvolume{-- nnn}
\jmlrproceedings{PMLR}{}
\editors{Accepted for publication at MIDL 2024}




\title[Reducing Uncertainty in 3D Medical Image Segmentation]{\transformer}

 
% Three or more authors with the same address:
\midlauthor{\Name{Sanaz Karimijafarbigloo \nametag{$^{1}$}} \Email{ sanaz.karimijafarbigloo@ur.de}\\
\Name{Reza Azad \nametag{$^{2}$}} 
\Email{Azad@pc.rwth-aachen.de}\\
\Name{Amirhossein Kazerouni  \nametag{$^{3}$}} \Email{amirhossein477@gmail.com}\\  
\Name{Dorit Merhof  \nametag{$^{1,4}$}} \Email{dorit.merhof@informatik.uni-regensburg.de}\\
\addr $^{1}$ Faculty of Informatics and Data Science, University of Regensburg, Regensburg, Germany\\
\addr $^{2}$ RWTH Aachen University, Aachen, Germany \\
\addr $^{3}$ School of Electrical Engineering, Iran University of Science and Technology, Tehran, Iran \\
\addr $^{4}$ Fraunhofer Institute for Digital Medicine MEVIS, Bremen, Germany
}
 

\begin{document}

\maketitle
\vspace{-0.75em}
\begin{abstract}
Despite recent successes in semi-supervised learning for natural image segmentation, 
applying these methods to medical images presents challenges in obtaining discriminative representations from limited annotations. While contrastive learning frameworks excel in similarity measures for classification, their transferability to precise pixel-level segmentation in medical images is hindered, particularly when confronted with inherent prediction uncertainty.
To overcome this issue, our approach incorporates two subnetworks to rectify erroneous predictions. The first network identifies uncertain predictions, generating an uncertainty attention map. The second network employs an uncertainty-aware descriptor to refine the representation of uncertain regions, enhancing the accuracy of predictions. Additionally, to adaptively recalibrate the representation of uncertain candidates, we define class prototypes based on reliable predictions. We then aim to minimize the discrepancy between class prototypes and uncertain predictions through a deep contrastive learning strategy.
Our experimental results on organ segmentation from clinical MRI and CT scans demonstrate the effectiveness of our approach compared to state-of-the-art methods. \href{https://github.com/xmindflow/uncertainty3D}{Code}. 
\end{abstract}

\begin{keywords}
    Uncertainty, Contrastive, Segmentation, Medical Image.
\end{keywords}

\vspace{-0.75em}
\section{Introduction}
\label{sec:intro}
\vspace{-0.5em}
Medical image segmentation plays a pivotal role in the field of medical imaging, serving as a crucial step in the analysis and interpretation of complex visual data. This process involves the partitioning of images into meaningful and clinically relevant regions, allowing for a detailed examination of structures and abnormalities within the human body. One prominent approach to medical image segmentation is supervised learning. This paradigm involves training algorithms on labeled datasets, where each image is accompanied by annotations identifying the regions of interest. Despite its success in many applications, its main drawback lies in its dependency on large and accurate labeled datasets \cite{aljuaid2022survey,azad2023advances}. Creating such datasets for medical images requires considerable expertise and time. Additionally, supervised learning is susceptible to human error caused by manual segmentation and labeling.

In response to these challenges, various strategies have been proposed to alleviate the dependency on meticulous labeling processes. Unsupervised learning operates without labeled data. Algorithms in this category identify inherent patterns and structures within medical images without prior knowledge of specific regions \cite{caron2018deep,chen2020medical,hamilton2022unsupervised,zhao2022uda,feng2023unsupervised,omidi2024unsupervised}. However, its limitations include the difficulty of distinguishing between normal and abnormal structures, hindering applicability in clinical settings that require precise identification. Interpretability of results is also challenging, as decisions rely solely on inherent data patterns. Transfer learning is another powerful strategy for enhancing medical image segmentation, leveraging pre-trained models to improve performance on tasks with limited labeled data \cite{kora2022transfer,araujo2022automatic,alhares2023amtldc}. However, its drawback stems from the assumption of similar distributions between the source and target domains. If this assumption is not met, pre-trained features may not capture the nuances of the target medical imaging data, potentially leading to suboptimal results. Self-supervised learning overcomes label scarcity by generating its own supervisory signals, enhancing model performance \cite{tang2022self,karimijafarbigloo2023self,ouyang2022self,kazerouni2023fusenet}. This approach, beneficial when labeled data is limited, faces the challenge of designing effective surrogate tasks. Ensuring these tasks capture pertinent information is crucial for the success of self-supervised methods. 
Furthermore, these approaches usually require additional supervisory signals derived from annotated data to be specifically directed toward the targeted task.

To address these issues, semi-supervised learning (SSL) offers a promising solution. This approach involves training models with a limited number of labeled samples and a large number of unlabeled data, striking a balance between supervised and unsupervised methods. In medical image segmentation, it proves valuable in scenarios where obtaining a fully labeled dataset is impractical, providing a practical and cost-effective solution for training robust segmentation models \cite{luo2022semi1,wu2022mutual,luo2022semi2,wang2023mcf}. SSL commonly employs two primary methods: pseudo-labeling and consistency regularization. Pseudo labeling involves generating ``pseudo labels" for unlabeled data using a model's predictions. In a teacher-student \cite{tarvainen2017mean, wang2022semi, xu2021end} scenario, the teacher model, represented as the EMA of the student model, plays the role of pseudo labels generation. These pseudo labels are then integrated with the original labeled dataset to optimize accuracy and achieve cost-effective training \cite{lee2013pseudo,xie2020self}. However, it's important to acknowledge that while the pseudo-labeling approach offers advantages, the scarcity of labeled data raises concerns about the reliability of pseudo-label quality. Hence, this method may introduce the risk of inaccuracies in the training data, potentially affecting the final model's precision.

To address the mentioned issue, current methods suggest adopting confidence score filtering for predictions \cite{zuo2021self,zou2020pseudoseg,zhang2021flexmatch, sohn2020fixmatch}. This means that only the predictions with high confidence scores are employed as pseudo-labels, while those that are uncertain are disregarded. Nevertheless, this approach is not perfect in removing inaccurate predictions, as some incorrect predictions might possess high classification scores, known as over-confidence or miscalibration \cite{guo2017calibration}. Furthermore, setting a high threshold would significantly decrease the quantity of generated pseudo-labels, thereby restricting the efficacy of semi-supervised learning. Additionally, the potential problem of solely relying on presumably reliable predictions (which may instead comprise inadequate representations of certain classes or segments) may lead to an imbalance of the training data and ultimately compromise the model's performance, particularly for challenging and less frequent classes. For instance, when the model encounters difficulty in accurately predicting specific classes, generating accurate pseudo-labels for the corresponding pixels becomes problematic. In this respect, Lu et al.\ ~\cite{lu2023mutually} suggested a technique that combines pseudo-labeling with dual consistency regularization, emphasizing its strong uncertainty awareness capability. This approach incorporates a cycle-loss regularization to enhance the accuracy of uncertainty estimation. 
Shen et al.~\cite{shen2023co} introduced the UCMT method for semi-supervised semantic segmentation. This approach consists of two main components: Collaborative Mean-Teacher (CMT) and an uncertainty-guided region mix. The CMT component aims to maintain model disagreement while enhancing the quality of pseudo-labels through collaboration.
Specifically, UCMT generates a new image by replacing uncertain regions with certain ones and then utilizes a collaborative approach to ensure consistent predictions across different networks. However, a limitation of this method is that it does not explicitly modify the representation of related voxels to reduce uncertainty. Consequently, there is a need for a mechanism to re-represent these uncertain voxels with different localities, such as through deformable convolutions, which could significantly enhance the overall effectiveness of the approach.



Acknowledging the reliability concerns linked with pseudo-labeling and the drawbacks of confidence score filtering, our method introduces a novel semi-supervised contrastive learning approach to address these challenges. In this context, we outline the following key contributions:
\textbf{1)} We propose a mechanism to recognize uncertain predictions as a means to refine network representation, aiming for improved overall representation quality.
\textbf{2)} To alleviate prediction uncertainty, we introduce an uncertainty-aware feature descriptor module. This module enhances contextual and semantic representation, contributing to a more robust and accurate prediction.
\textbf{3)} We design a deep contrastive supervision function to minimize discrepancies between class prototypes and uncertain predictions.

\vspace{-1.5em}
\section{Method}
\vspace{-0.25em}
{
Current SSL algorithms that rely on consistency learning, such as Mean-teacher~\cite{tarvainen2017mean} and~\cite{chen2021semi}, propose applying consistency regularization not within a single model but among the pseudo labels within a multi-model architecture. Nonetheless, throughout the training process, there is a tendency for the dual-network SSL framework to quickly reach a consensus, causing the co-training to degrade into self-training  \cite{kendall2017uncertainties}. To address this issue, we introduce a setup consisting of a predictive model accompanied by an auxiliary model. The predictive model serves as a pseudo-label generator, directing the training of the other model.
In our strategy, we work with two distinct datasets: $\mathcal{D}_l$, comprising labeled data pairs $(\mathbf{x}_i^l, \mathbf{y}_i^l)$, where $i$ ranges from 1 to $N_l$, and $\mathcal{D}_u$, a considerably larger unlabeled dataset denoted as $\left \{ x_{i}^{u}\right \}_{i=1}^{N_u}$, $N_l$ and $N_u$ indicate the number of samples in the labelled and unlabled dataset, respectively. The goal is to synergistically leverage the inherent potential of the unlabeled data to enhance feature representation, effectively training our semantic segmentation model by integrating a limited set of labeled data with an extensive repository of unlabeled data.
For labeled data, both predictive and auxiliary models undergo optimization through supervised learning:
\begin{equation}
\label{eq:suploss}
\begin{aligned}
\mathcal{L}_s = \frac{1}{|\mathcal{B}_l|} \sum_{(\mathbf{x}_i^l, \mathbf{y}_i^l) \in \mathcal{B}_l} \ell_{ce}({\hat{\mathbf{y}}_i^l}, \mathbf{y}_i^l)
+ \text{Dice}({\hat{\mathbf{y}}_i^l}, \mathbf{y}_i^l) +
\ell_{ce}({\tilde{\mathbf{y}}_i^l}, \mathbf{y}_i^l)
+ \text{Dice}({\tilde{\mathbf{y}}_i^l}, \mathbf{y}_i^l)
\end{aligned}
\end{equation}
where ${\hat{\mathbf{y}}_i^l}$, ${\tilde{\mathbf{y}}_i^l}$ denotes the predicted segmentation map of the predictive and auxiliary networks, respectively. $B_l$ indicates the batch of labeled data. When handling unlabeled data, two key factors come into play: 1) direct guidance from the predictive model to the auxiliary model to reduce uncertainty and 2) minimizing uncertain predictions through the definition of deep-supervisory contrastive learning. To accomplish the former, we first fed the unlabeled sample into the predictive model to generate the pseudo label ${\hat{\mathbf{y}}_i^u}$. Then using the pseudo label we calculate the prediction loss for the auxiliary models.
To further enhance error correction, we incorporate a regularization loss $\mathcal{L}_{reg}$, which calculates the $L_1$ distance between the XOR predictions \cite{wang2023mcf} of the predictive and the auxiliary models: 
\begin{equation}
\label{eq:unsloss}
\mathcal{L}_u = \frac{1}{|\mathcal{B}_u|} \sum_{\mathbf{x}_i^u \in \mathcal{B}_u} \ell_{ce}({\tilde{\mathbf{y}}_i^u}, \hat{\mathbf{y}}_i^u)
+ \text{Dice}({\tilde{\mathbf{y}}_i^u}, \hat{\mathbf{y}}_i^u)+\lambda_{reg}\mathcal{L}_{reg}({\tilde{\mathbf{y}}_i^u}, \hat{\mathbf{y}}_i^u),
\end{equation}
where $\lambda_{reg}$ indicates the weight of the $\mathcal{L}_{reg}$. We additionally restrict the computation of the loss function to voxels with prediction confidence exceeding $0.7$. The overall training loss ($\mathcal{L}$) incorporates three components: the supervised loss ($\mathcal{L}_s$), unsupervised loss ($\mathcal{L}_u$), and the contrastive loss ($\mathcal{L}_c$). Details of the contrastive loss are presented in \autoref{sec:contrastive}. The optimization objective revolves around minimizing the overall loss, expressed as:
\begin{equation}
\mathcal{L} = \mathcal{L}_s + \lambda_u \mathcal{L}_u + \lambda_c \mathcal{L}_c,
\end{equation}
 The weights $\lambda_u$ and $\lambda_c$ determine the contribution of the unsupervised and contrastive losses, respectively. The overall architecture of the network is depicted in \autoref{fig:uncertainmethod}.}


\begin{figure*}
\centering
    \includegraphics[width=0.9\textwidth]{Figures/method_contrastive.pdf}
    \vspace{-0.5em}
    \caption{The dual-path semi-supervised contrastive learning method proposed aims to harness unsupervised data while minimizing uncertain predictions.
    }\label{fig:uncertainmethod}
    \vspace{-1.5em}
\end{figure*}


\begin{figure*}
\centering
    \includegraphics[width=0.9\textwidth]{Figures/uncertainty_module.pdf}
    \vspace{-0.5em}
    \caption{\textbf{MultiCURE Module:} Adaptive feature recalibration across scales using three branches, overcoming fixed kernel limitations with deformable convolutions, selectively extracting information from uncertain and certain regions, and integrating through channel-wise concatenation with a soft attention mechanism.
    }\label{fig:MedScale-Former}
    \vspace{-1.5em}
\end{figure*}

\vspace{-0.75em}
\subsection{Multi-Branch Contextual Uncertainty Reduction (MultiCURE) Module}
\vspace{-0.25em}
In our design, we argue that inaccuracies can arise due to inadequate contextual information in the vicinity of voxels, which makes precise predictions challenging. To address this issue, we first input the sample into the predictive model to generate the segmentation map. Then, We create a binary uncertainty map by thresholding the softmax output of the network. In this process, pixels with confidence scores below a specified threshold are labeled as uncertain regions (assigned a value of 1), while pixels with confidence scores above the threshold are labeled as certain regions (assigned a value of 0). Finally, we enhance the representation of uncertain voxels on the auxiliary map using the MultiCURE module.

MultiCURE, a key component in our proposed architecture, is designed to recalibrate context information across multiple scales by adaptively selecting receptive fields from global and local pathways. It effectively overcomes the limitations of fixed kernel sizes in traditional convolutions, reducing uncertainty in critical areas, especially along object boundaries, which inherently have the highest degree of uncertainty. Comprising three branches, two maintain fixed kernel sizes, enhancing the flexibility of receptive field sizes. However, grid misalignment issues persist, particularly along object boundaries (or uncertain regions), resulting in less optimal results. The third branch provides vital global information, refining boundary delineation, and significantly helps to improve overall performance.

The MultiCURE employs a three-path split, incorporating convolution, Batch Normalization (BN), and ReLU activation function in two paths with $3\times3$ and $5\times5$ kernel sizes. The third path performs a $3\times3$ Deformable convolution \cite{dai2017deformable} on the input $z \in R^{H \times W \times d}$. The deformable convolution in our design dynamically adjusts the receptive field for each feature map location by using an offset field to flexibly warp the sampling grid. This allows the model to handle varied object sizes in an image and gain a superior understanding of object regions. 
To further enhance feature representation within uncertain regions, we integrate (i.e., sum) an uncertain attention map with the output of the deformable convolution. This integration conditions the representation based on uncertain regions, serving as a mechanism to accentuate the representation of these areas using non-fixed resampling points. Consequently, this approach allows us to allocate increased attention to uncertain regions.



To integrate information from all branches, a fusion step entails performing channel-wise concatenation: $ q = [u_1 || u_2 || u_3] \in R^{H \times W \times 3d}$. The next step involves applying global average pooling (GAP) to $q$ to condense the spatial information across the entire feature map. In this context, the vector $q' \in \mathbb{R}^{3d}$ undergoes a linear transformation denoted as $\mathcal{F}: q' \rightarrow k$, resulting in the vector $k$ in $\mathbb{R}^{3d/r}$. This transformation serves the purpose of dimensionality reduction, thereby enhancing computational efficiency. Subsequently, each path follows an independent linear layer to revert the transformed vector of dimension $3d/r$ back to its original dimensionality, denoted as $d$ in the original input. 

By concatenating all paths within the channel dimension (\( K \in R^{3 \times d} \)), a soft attention mechanism, specifically the SoftMax function, is deployed across channels. This adaptive approach facilitates the selective emphasis on the most pertinent feature scales. The resultant feature map, denoted as \( z' \in R^{H \times W \times d} \), is attained by applying attention weights to kernels associated with the individual streamlines:
$z' = u_1\cdot K_{1} + u_2\cdot K_{2} + u_3\cdot K_{3}$

\vspace{-0.5em}
\subsection{Deep Contrastive Learning}
\label{sec:contrastive}
\vspace{-0.25em}
In our methodology, we propose deep contrastive supervision to refine the model's representation and enhance its discriminative capabilities. Initially, we identify high-confidence and uncertain predictions within the segmentation output. Leveraging the high-confidence predictions, we define class prototypes by extracting representations from multiple network levels, to capture both shallow and depth representation. To compute each prototype, we calculate the mean vector ($c$) of reliable voxel representations for class $k$ by defining $c_k$ as:
\begin{equation}
\mathbf{c}_k = \frac{1}{\left|S_k\right|} \sum_{(\mathbf{v^r}_i, y_i) \in S_k} f_{l:L}(\mathbf{v^r}_i),
\end{equation}
where \( f_{l:L}(\mathbf{v^r}_i) \) represents the feature representation of the voxel corresponding to the reliable predictions \( (\mathbf{v^r}_i, y_i) \) from different levels of the network and $S_k$ is the set of certain predictions for class $k$.
Subsequently, for the set of uncertain predictions \( f(\mathbf{v^ur}_i) \), we resample candidates to align them with the corresponding class prototype, employing a contrastive learning algorithm for this purpose. By applying contrastive learning to feature sets extracted from various network blocks, we provide a deep supervisory signal for the network to contextually recalibrate the representation of uncertain pixels, aligning them with the class prototype. The contrastive loss is computed by aggregating their representations, and our objective is to minimize this loss. This approach leverages multi-level representations for deep supervision, enabling the refinement and improvement of the network's predictions to generate more discriminative features.


\begin{equation}
\mathcal{L}_{c_k} = -\frac{1}{\left|S_k\right|} \sum_{(\mathbf{v^u}_i, y_i) \in S_k} \log \frac{\exp\left(\frac{\mathrm{sim}(\mathbf{v^u}_i, \mathbf{c}_k)}{\tau}\right)}{\sum_{j \neq k} \exp\left(\frac{\mathrm{sim}(\mathbf{v^u}_i, \mathbf{c}_j)}{\tau}\right)},
\end{equation}
where, $\mathcal{L}_{c_k}$ is the contrastive loss for class $k$,
% $S_k$ is the set of uncertain predictions for class $k$,
% $\mathbf{c}_k$ is the class prototype for class $k$,
$\operatorname{sim}\left(\mathbf{v}^{\mathbf{u}}{ }_i, \mathbf{c}_k\right)$ is the similarity function measuring the cosine similarity between an uncertain voxel representation $\mathbf{v}^{\mathbf{r}}{ }_i$ and the class prototype $\mathbf{c}_k$,
$\tau$ is a temperature parameter controlling the sharpness of the contrastive loss function. We augment the contrastive loss by adding an additional term that considers the distance between class prototypes. This extra term is incorporated to promote the representation space to actively separate the clustering spaces of different classes.

\vspace{-0.5em}
\section{Experiments}
\vspace{-0.25em}

We implemented a two-stream pipeline in PyTorch, utilizing an RTX A5000 GPU. Following \cite{wang2023mcf}, Vnet and Resnet were chosen for the predictive and auxiliary models, respectively. The training involved 6000 iterations with a batch size of 2, sampling randomly from supervised and unsupervised sources. SGD optimizer with parameters: decay factor 0.0001, momentum 0.9, and initial learning rate 0.01 (decayed by a factor of 10 every 2500 iterations). To manage hyperparameters, we set $\lambda_u = 1.0$ and $\lambda_c = 0.1*e^{4(1-t/t_{max})^2}$ for dynamic weighting, where $t$ and $t_{max}$ denote current and maximum iterations. For evaluation, we follow \cite{wang2023mcf} setting and use 5-fold and 4-fold cross-validation on the LA and Pancreas datasets, respectively.
\vspace{-0.5em}
\subsection{Dataset}
\vspace{-0.25em}
\noindent\textbf{Left Atrial Dataset (LA):}
This dataset~\cite{xiong2021global} consists of 100 3D gadolinium-enhanced MR imaging volumes with non-uniform resolution (0.625 × 0.625 × 0.625 mm³) and manual annotations for the left atrial region. Following the pre-processing protocol from \cite{wang2023mcf}, we normalized volumes to zero mean and unit variance. During training, random cropping used with dimensions of 112 × 112 × 80. For inference, we used a sliding window approach (112 × 112 × 80) with a stride of 18 × 18 × 4.

\noindent\textbf{NIH Pancreas Dataset:}
This dataset~\cite{roth2015deeporgan} consists of 82 abdominal CT volumes annotated for the pancreas. We preprocess it by applying soft tissue windowing (HU range: -120 to 240) and spatial alignment using a method from \cite{luo2021semi,wang2023mcf}. In training, we use random cropping for volumes, resulting in dimensions of 96×96×96. During inference, a stride of 16×16×16 is employed for efficient data processing. 

\vspace{-0.5em}
\subsection{Results}
\vspace{-0.25em}
Table~\ref{table1} illustrates the performance comparison between our proposed approach and the latest State-of-the-Art (SOTA) methods. Our method exhibits substantial enhancements across all metrics, excelling particularly in organ voxel detection, notably in Dice and Jaccard indices. This demonstrates that the incorporation of the uncertainty map along with the deep contrastive supervision can significantly enhance the efficacy of the model. In comparison to MCF, our technique is clearly superior, achieving a notable boost in DSC from 88.71 to 89.21 and a remarkable enhancement in the Jaccard index from 80.41 to 81.64, while simultaneously maintaining a superior 95HD metric. Furthermore, our method not only maintains a stable and reliable performance by minimizing variance but also excels in the visual comparison presented in \autoref{fig:visualresults}, showcasing its superiority in left atrial segmentation. The visuals emphasize the increased alignment with ground truth labels and a noticeable reduction in false segmentations, signifying the nuanced details effectively captured by our innovative approach. 

\begin{table}[!htb]
    \centering
    \caption{{Comparison of results using the LA dataset (average $\pm$ standard deviation).}}
    \label{table1}
    \resizebox{0.9\columnwidth}{!}{
    \begin{tabular}{c|cccc}
        \hline 
        \hline
        Method & Dice(\%)$\uparrow$ & Jaccard(\%)$\uparrow$ & 95HD(voxel)$\downarrow$ & ASD(voxel)$\downarrow$ \\
        \hline
        MT \cite{tarvainen2017mean} & 85.89 $\pm$ 0.024 & 76.58 $\pm$ 0.027 & 12.63 $\pm$ 5.741 & 3.44 $\pm$ 1.382 \\
        UA-MT \cite{yu2019uncertainty} & 85.98 $\pm$ 0.014 & 76.65 $\pm$ 0.017 & 9.86 $\pm$ 2.707 & 2.68 $\pm$ 0.776 \\
        SASSNet \cite{li2020shape} & 86.21 $\pm$ 0.023 & 77.15 $\pm$ 0.024 & 9.80 $\pm$ 1.842 & 2.68 $\pm$ 0.416 \\
        DTC \cite{luo2021semi} & 86.36 $\pm$ 0.023 & 77.25 $\pm$ 0.020 & 9.02 $\pm$ 1.015 & 2.40 $\pm$ 0.223 \\
        MC-Net \cite{wu2021semi} & 87.65 $\pm$ 0.011 & 78.63 $\pm$ 0.013 & 9.70 $\pm$ 2.361 & 3.01 $\pm$ 0.700 \\
        UCMT \cite{shen2023co}  & 88.13 $\pm$ 0.000 & 79.18 $\pm$ 0.000 & 9.14 $\pm$ 0.000 & 3.06 $\pm$ 0.000 \\
        MCF \cite{wang2023mcf}  & 88.71 $\pm$ 0.018 & 80.41 $\pm$ 0.022 & 6.32 $\pm$ 0.800 & \textbf{1.90 $\pm$ 0.187} \\
        \hline
        \rowcolor[HTML]{C8FFFD}
        \textbf{Our Method} & \textbf{89.21 $\pm$ 00.18} &  \textbf{81.64 $\pm$ 00.24} &  \textbf{6.31 $\pm$ 0.842} &  1.92 $\pm$ 0.195  \\
        \hline
        \hline
    \end{tabular}}
    \vspace{-0.75em}
\end{table}


Additionally, this robust performance also extends to the NIH Pancreas Dataset, as evidenced in the comprehensive results provided in \autoref{table2}.
As the pancreas is situated deep within the abdomen, it exhibits notable variations in size, location, and shape. Adding to the complexity, pancreatic CT volumes present a more intricate background compared to the relatively simpler background of left atrial MRI volumes. This inherent complexity makes pancreas segmentation a more challenging task than left atrial segmentation. More specifically, our method outperformed all SOTA methods across all performance metrics in pancreas segmentation.
Expanding on the segmentation outcomes, \autoref{fig:visualresults} offers deeper insights, emphasizing the consequential influence of the proposed modules on elevating the overall segmentation quality. Notably, our method excels by producing sharper edges and achieving more precise boundary separation compared to the MCF and MC-Net methodologies. This emphasizes its effectiveness in enhancing the reliability of object boundary predictions and distinctly discerning the organ of interest from the background.


\begin{figure}[!thb]
\centering
\resizebox{0.78\textwidth}{!}{
    \begin{tabular}{@{} *{4}c @{}}
    \includegraphics[width=0.25\textwidth, trim=219.5pt 140pt 219.5pt 110pt, clip]{Figures/LA/EXR-GT.png} &
    \hspace{-1em} % Adjust the negative \hspace value as needed
    \includegraphics[width=0.25\textwidth, trim=219.5pt 140pt 219.5pt 110pt, clip]{Figures/LA/EXR_PRD.png} &
    \hspace{-1em} % Adjust the negative \hspace value as needed
    \includegraphics[width=0.25\textwidth, trim=219.5pt 140pt 219.5pt 110pt, clip]{Figures/LA/EXR-PRD2.png} & 
    \hspace{-1em} % Adjust the negative \hspace value as needed
    \includegraphics[width=0.25\textwidth, trim=219.5pt 140pt 219.5pt 110pt, clip]{Figures/LA/EXR-PRD1.png} \\
    \includegraphics[width=0.25\textwidth, trim=219.5pt 140pt 219.5pt 110pt, clip]{Figures/pancreas/72-gt.png} &
    \hspace{-1em} % Adjust the negative \hspace value as needed
    \includegraphics[width=0.25\textwidth, trim=219.5pt 140pt 219.5pt 110pt, clip]{Figures/pancreas/72-prd.png} &
    \hspace{-1em} % Adjust the negative \hspace value as needed
    \includegraphics[width=0.25\textwidth, trim=219.5pt 140pt 219.5pt 110pt, clip]{Figures/pancreas/72-prd2.png} &
    \hspace{-1em} % Adjust the negative \hspace value as needed
    \includegraphics[width=0.25\textwidth, trim=219.5pt 140pt 219.5pt 110pt, clip]{Figures/pancreas/72-prd1.png} \\
    {\small (a) Ground Truth} & {\small(b) MC-Net} & {\small(c) MCF} & {\small\textbf{(e) Proposed Method}} 
    \end{tabular}
    }
    \vspace{-0.5em}
    \caption{Visual comparison of segmentation results: the first and the second rows show the left atrium (LA) and pancreas, respectively.} \label{fig:visualresults}
    \vspace{-1em}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%
\vspace{-0.5em}
\begin{table}[!htb]
    \centering
    \vspace{-1.5em}
    \caption{{Comparison of results using the NIH dataset (average $\pm$ standard deviation).}}
    \label{table2}
    \resizebox{0.9\columnwidth}{!}{
    \begin{tabular}{c|cccc}
        \hline 
        \hline
        Method & Dice(\%)$\uparrow$ & Jaccard(\%)$\uparrow$ & 95HD(voxel)$\downarrow$ & ASD(voxel)$\downarrow$ \\
        \hline
        MT \cite{tarvainen2017mean} & 74.43 $\pm$ 0.024 & 60.53 $\pm$ 0.030 & 14.93 $\pm$ 2.000 & 4.61 $\pm$ 0.929 \\
        UA-MT \cite{yu2019uncertainty} & 74.01 $\pm$ 0.029 & 60.00 $\pm$ 3.031 & 17.00 $\pm$ 3.031 & 5.19 $\pm$ 1.267 \\
        SASSNet \cite{li2020shape} & 73.57 $\pm$ 0.017 & 59.71 $\pm$ 0.020 & 13.87 $\pm$ 1.079 & 3.53 $\pm$ 1.416 \\
        DTC \cite{luo2021semi} & 73.23 $\pm$ 0.024 & 59.18 $\pm$ 0.027 & 13.20 $\pm$ 2.241 & 3.81 $\pm$ 0.953 \\
        MC-Net \cite{wu2021semi} & 73.73 $\pm$ 0.019 & 59.19 $\pm$ 0.021 & 13.65 $\pm$ 3.902 & 3.92 $\pm$ 1.055 \\
        MCF \cite{wang2023mcf} & 75.00 $\pm$ 0.026 & 61.27 $\pm$ 0.030 & 11.59 $\pm$ 1.611 & 3.27 $\pm$ 0.919 \\
        \hline
        \rowcolor[HTML]{C8FFFD}
        \textbf{Our Method} & \textbf{76.20 $\pm$ 0.022} & \textbf{62.33 $\pm$ 0.028} & \textbf{11.55 $\pm$ 2.703} & \textbf{3.10 $\pm$ 0.0980} \\
        \hline
        \hline
    \end{tabular}}
    \vspace{-1em}
\end{table}

%%%%%%%%%%%%%%%%%%%%%



\section{Conclusion}
\vspace{-0.25em}
In summary, our approach tackles challenges in semi-supervised medical image segmentation through the integration of two subnetworks aimed at identifying and refining uncertain predictions. The model employs both an uncertainty attention map and an uncertainty-aware descriptor to enhance accuracy in pixel-level segmentation, particularly in scenarios with inherent prediction uncertainty. The efficiency of our method is substantiated by results obtained from both left atrial and pancreas datasets.

% \bibliography{midl-samplebibliography}
\bibliography{midl24_212}
\let\cleardoublepage\clearpage
\appendix
\newpage
\section{Ablation Study on the Effect of Suggested Modules}
In our strategic approach, we incorporated both the deep contrastive learning algorithm and the MultiCURE Module to mitigate prediction uncertainty. To assess the impact of this loss function in reducing uncertainty, we have illustrated the uncertainty map of the network predictions on two sample data in \autoref{fig:ours}. Specifically, we conducted training with two settings: the first setting excludes the MultiCURE Module and deep contrastive learning, while the second setting includes them to address uncertainty.

\autoref{fig:ours} indicates that, during the inference process, the network tends to produce lower confidence scores in cases without utilizing the uncertainty aware modules. However, including the MultiCURE Module and deep contrastive modules leads to a more substantial increase in the models' prediction confidence. This observation suggests that incorporating these modules enhances the network's confidence in predicting uncertain voxels, thereby improving overall prediction performance. Additionally, \autoref{table:3} has been provided to highlight the individual effect of each module on overall performance. Removing the MultiCURE Module results in a notable decline in model performance, and similarly, omitting deep contrastive learning leads to a slight drop in model performance.

\begin{figure}[!thb]
    \centering
    \resizebox{\textwidth}{!}{
    \begin{tabular}{@{} *{5}c @{}}
    \includegraphics[width=0.22\textwidth]{Figures/suple/gt_s1.png} &
    \includegraphics[width=0.22\textwidth]{Figures/suple/heatmap2_s1.png} &
    \includegraphics[width=0.22\textwidth]{Figures/suple/heatmap_s1.png} &
    \includegraphics[width=0.22\textwidth]{Figures/suple/heatmap2_s1_u.png} &
    \includegraphics[width=0.22\textwidth]{Figures/suple/heatmap_s1_u.png} \\
    \includegraphics[width=0.22\textwidth]{Figures/suple/gt_s2.png} &
    \includegraphics[width=0.22\textwidth]{Figures/suple/heatmap2_s2.png} &
    \includegraphics[width=0.22\textwidth]{Figures/suple/heatmap_s2.png} &
    \includegraphics[width=0.22\textwidth]{Figures/suple/heatmap2_s2_u.png} &
    \includegraphics[width=0.22\textwidth]{Figures/suple/heatmap_s2_u.png} \\
    { \small Ground Truth} & {\small Prediction (ours)} & {\small Heatmap (ours)} & {\small Prediction (baseline)}  & {\small Heatmap (baseline)}
    \end{tabular}
}
    \caption{Visualization of prediction maps and activation maps for two samples from the LA dataset using our suggested modules alongside the baseline model without the proposed enhancements.} 
    \label{fig:ours}
\end{figure}


\begin{table*}[h!]
\caption{Effect of each suggested module on the overall performance using the Pancrese Dataset (average).}
\resizebox{\textwidth}{!}{
\label{table:3}
\begin{tabular}{cccccc}
\hline
$\mathcal{L}_{\text{contrastive}}$ & MultiCURE & Dice(\%)$\uparrow$ & Jaccard(\%)$\uparrow$ & 95HD(voxel)$\downarrow$ & ASD(voxel)$\downarrow$ \\
\hline 
 $\times$ & $\times$ & 74.10 & 61.00  & 12.30 & 4.12 \\
$\sqrt{ }$ & $\times$ & 75.00 & 61.22 & 11.95 & 3.27 \\
$\times$ & $\sqrt{ }$ & 75.80 & 62.33 & 11.58 & 3.25 \\
\hline
\rowcolor[HTML]{C8FFFD}
$\sqrt{ }$ & $\sqrt{ }$ & 76.20 & 62.33 & 11.55 & 3.10 \\
\end{tabular}
}
\end{table*}


The module's architecture allows the network to adapt its focus based on the characteristics of the input, providing a mechanism to selectively reduce uncertainty in regions where it is most crucial. This adaptability contributes to the observed improvement in the model's confidence, especially in predicting uncertain voxels. The results suggest that the MultiCURE Module enhances the network's ability to handle contextual uncertainties, leading to more confident and accurate predictions.
Also, the contrastive loss is employed to guide the learning process. The results in \autoref{table:3} reveal that the absence of contrastive learning negatively impacts the model's performance, emphasizing its role in enhancing feature representations. The contrastive learning mechanism enables the model to focus on relevant information, thereby improving its ability to make accurate predictions. The combination of the MultiCURE Module and contrastive learning synergistically contributes to the overall success of the proposed architecture. While the MultiCURE Module addresses contextual uncertainties, contrastive learning complements this by refining feature representations. The experimental results underscore the efficacy of this combined approach in achieving more robust and confident predictions, particularly in scenarios involving uncertain or ambiguous image data.


\section{Limitation}
In our strategy, we proposed various mechanisms to increase the certainty of pseudo-labels, thereby enhancing the utilization of unlabeled data. While our method is designed to recalibrate features to improve voxel representation for certain predictions, uncertainty in voxel representation along object boundaries is often inherent due to abnormalities in the imaging device, making it challenging to distinguish between object boundaries and overlapped backgrounds. Despite our efforts, our method still faces limitations in enhancing prediction certainty along boundary regions, which are influenced by the characteristics of the imaging device. As illustrated in \autoref{fig:ours}, the predictions of the LA boundaries lack high confidence. It's noteworthy that this limitation is also encountered by expert radiologists in precisely distinguishing boundary voxels, especially in 3D volumes, where multiple raters are often employed to mitigate boundary errors by prioritizing the most agreed-upon regions \cite{ji2021learning}.


Additionally, While our strategy focuses on enhancing the network receptive field size, incorporating our MultiCURE module, it is essential to acknowledge its limitations, particularly in scenarios involving small lesions, such as microbleedings in brain MRI. Despite our efforts to extend the receptive field size for capturing long-range dependencies, our approach may not adequately address the need for strong texture representation required for precise detection of such lesions, which involves both fine-grained and coarse features. This limitation highlights the challenges inherent in balancing the requirements of capturing both fine-grained and coarse features within a single model framework. Future research directions could explore specialized architectures or fusion techniques aimed at improving the detection sensitivity for these types of lesions.


\let\cleardoublepage\clearpage
\end{document}



