% This is samplepaper.tex, a sample chapter demonstrating the
% LLNCS macro package for Springer Computer Science proceedings;
% Version 2.21 of 2022/01/12
%
\documentclass[runningheads]{llncs}
%
\usepackage[T1]{fontenc}
% T1 fonts will be used to generate the final print and online PDFs,
% so please use T1 fonts in your manuscript whenever possible.
% Other font encondings may result in incorrect characters.
%
\usepackage{graphicx}
% Used for displaying a sample figure. If possible, figure files should
% be included in EPS format.
%
\usepackage{subcaption}
% If you use the hyperref package, please uncomment the following two lines
% to display URLs in blue roman font according to Springer's eBook style:
%\usepackage{color}
%\renewcommand\UrlFont{\color{blue}\rmfamily}
\usepackage[pagebackref=true,breaklinks=true,colorlinks,bookmarks=false]{hyperref}
%
\begin{document}
%
\title{A Simple Mean-Teacher UNet Model for Efficient Abdominal Organ Segmentation}
%
%\titlerunning{Abbreviated paper title}
% If the paper title is too long for the running head, you can set
% an abbreviated paper title here
%
\author{Zixiao Zhao\orcidID{0000-0003-2808-4659} \and
Jiahua Chu\orcidID{0000-0001-5031-3046}}
%
\authorrunning{Z. Zhao et al.}
% First names are abbreviated in the running head.
% If there are more than two authors, 'et al.' is used.
%
\institute{AI Innovation and Commercialisation Centre, NUSRI, Suzhou, China \\
\email{\{zixiao.zhao, jiahua.chu\}@nusri.cn}
}
%
\maketitle              % typeset the header of the contribution
%
\begin{abstract}
One inevitable barrier to deep learning-based medical image segmentation algorithms is that for such tasks requiring high accuracy, all models must be trained using large datasets annotated by experts, and this process is exceptionally time-consuming and laborious.
For abdominal organ segmentation, this problem becomes more prominent as the image size becomes larger.
To address this problem, we design a classical UNet model using the Mean-Teacher strategy to obtain relatively satisfactory segmentation (58.93\% DSC and 59.54\% NSD)results on a semi-supervised abdominal segmentation dataset. 
The core idea is to use labeled data to improve the segmentation performance of the model itself, while introducing noise on unlabeled data to improve the generalization of the model.
Inspired by nnUNet, we use as simple a model structure as possible, thus ensuring the efficiency during training and inference phases (< 2GB VRAM consumption and $\sim$10s inference time).

\keywords{Medical Image Segmentation \and Abdominal Organ Segmentation \and Semi-supervised \and UNet \and Mean-Teacher}
\end{abstract}


%####################################################################
\section{Introduction}

In recent years, Convolutional Neural Networks (CNNs) and Transformers-based approaches have achieved state-of-the-art results in the field of medical image segmentation, e.g.~\cite{zhou2018unet++,chen2021transunet}.
However, with the development of such methods, the structure of the model becomes more and more complex, the parameters of the model increase dramatically, and the size of the annotated data required to train such complex models becomes larger and larger~\cite{ssl4mis2020}.
For medical image segmentation tasks, the annotation of the dataset implies expert labeling at pixel or voxel level, a process that is often extremely time-consuming and laborious~\cite{yu2019uncertainty}.
For abdominal organ segmentation, this problem becomes more serious because the organs or diseases contained in this region are more complex, and the size and resolution of the images become larger~\cite{AbdomenCT-1K}.

In this context, semi-supervised segmentation methods become more practical due to their properties of requiring only a small amount of fine annotation and more unlabeled data instead.
In the last three years, a large number of semi-supervised segmentation methods have achieved satisfactory results in their respective domains. 
One of the most widely used methods is the Mean-Teacher model~\cite{tarvainen2017mean} and its many variants~\cite{reiss2021every,yu2019uncertainty,you2022simcvd}. Other commonly used strategies include pseudo labeling~\cite{wang2021semi}, adversarial learning~\cite{hu2020coarse}, contrastive learning~\cite{peng2021self} and etc.

Despite advances in semi-supervised learning benchmarks, previous methods still face several major challenges:
\textbf{Domain variation:} Most of these methods are based on 2D natural images and require additional learning costs if migrated to medical images.
\textbf{Generalization:} Considering the limited amount of training data, training deep models is usually deficient due to over-fitting and
co-adapting~\cite{you2022simcvd}.

In this work, we propose a simple and effective semi-supervised scheme that is also based on the Mean Teacher~\cite{tarvainen2017mean} idea. 
This framework takes labeled and unlabeled images as input and introduces random noise for contamination, respectively. 
The uncontaminated original input images will predict the results by a Student model composed of an ordinary UNet~\cite{ronneberger2015u}, while the contaminated data will predict the other set of results by a Teacher model with exactly the same structure. 
For the labeled data, the Student model is supervised by ground truth on the one hand and by the consistency constraint of the predicted results of the contaminated data on the other hand, while for the unlabeled data, only their consistency loss is used for supervision. 
The parameters of the Teacher model are then periodically updated from the $M_S$ by exponential
moving average (EMA).

The main contribution of this work are two-fold: 1) Inspired by nnUnet~\cite{isensee2021nnu}, our approach uses only the classical UNet model for segmentation, making the training and prediction process cheap (<5GB RAM and <2GB VRAM) and efficient (6s/image). 
2) Still inspired by nnUnet~\cite{isensee2021nnu}, we use proper preprocessing methods (and multiple augmentation methods during training phase), which enables our model to achieve stable results even on data with inconsistent distribution.


%####################################################################
\section{Method}


%###########################
\subsection{Preprocessing}
Thanks to the rich transformation API provided by MONAI framework~\cite{MONAI_Consortium_MONAI_Medical_Open_2020}, we applied many pre-processing methods that can increase the reusability of the model.

\textbf{General preprocessing}: General preprocessing represents transforms that are applied in the training, validation and prediction phases.
\begin{itemize} 
 \item Orientation matching: Based on the orientation of training data, all input images are uniformly adjusted to the "LPI" orientation.
 \item Resampling method for anisotropic data: After orientation matching we resample the image to the spacing of (4, 4, 10) to reduce the size of the input data.
 \item Intensity normalization method: For the intensity of the data, we only reserve the voxels whose intensity is inside the interval [-1000, 500], and then adjusted the value range to [0.0, 1.0].
\end{itemize}



%###########################
\subsection{Proposed Method}
For general semi-supervised learning, the training set always consists of two parts. 
The labeled dataset $D_l$ with $N$ annotated images and the unlabeled dataset $D_u$, where there are $M$ raw images ($M >> N$). 
The whole training set is $D_{N+M} = D_l \cup D_u$. 
For an image $x_i \in D_l$, its ground truth is available. 
Conversely, if $x_j \in D_u$, its ground truth is not provided~\cite{luo2021semi}.
Our Mean Teacher UNet model is shown in Figure \ref{fig:Network}. 
For both $D_l$ and $D_u$, they will be used for the calculation of consistency loss, corresponding to $L_{c1}$ and $L_{c2}$ in the figure. 
For $D_l$, it is additionally used to compute the common supervised segmentation loss $L_s$ to update the model parameters.

\begin{figure}[htbp]
\centering
\includegraphics[scale=0.5]{imgs/meanteacher.png}
\caption{Network architecture: Student and Teacher model are both randomly initialized, which receive uncontaminated and contaminated data respectively. Teacher model's parameter will be gradually updated from Student model by EMA.}
\label{fig:Network}
\end{figure}

In fact, we followed the exact same strategy as Mean Teacher. 
The overall architecture of the network consists of two parts, Student model $M_S$ and Teacher model $M_T$.
In our design, these two models are composed of two identical initialized UNet models.

\begin{equation}\label{eq:teacher}
    \theta_{T}^{'} = \alpha\theta_{T} + (1-\alpha)\theta_{S}
\end{equation}

The update of $M_T$'s parameters is obtained by exponentially moving average from $M_S$'s parameters, depicted in equation \ref{eq:teacher}.
At the beginning of training phase, since model comes from random initialization, the parameters of $M_S$ are definitely incorrect.
$M_T$ should be based on what $M_S$ learns, so $\alpha$ should start from zero. 
As the network is being trained, after $M_S$ reaches a certain accuracy, the ensemble can eventually be used, which means $\alpha$ can come to the value of 0.99 in the end.
The network parameters of the $M_S$ are updated by the gradient descent of the loss function. 
The loss function includes two categories: first the supervised Dice loss, which ensures the model has the basic segmentation ability, the second part is the unsupervised loss function, or consistency loss, and here we use MSE loss, which mainly ensures that the prediction of $M_S$ is as similar as possible to the one of $M_T$ between the contaminated and uncontaminated data (the contamination applied here is the additive Gaussian white noise). 
Because the parameters of $M_T$ are the moving average of $M_S$, the prediction should not have too much jitters for any fluctuations. 
If the model is correct, the predicted labels of the two models Student and Teacher should be close. 
Then tuning the model in the direction that makes the prediction of the two models close is equal to move the model towards predicting the correct labels.


\subsection{Post-processing}
Due to the nature of the dataset, we did not use specfic post-processing methods.
%####################################################################

\section{Experiments}
\subsection{Dataset and evaluation measures}
The FLARE2022 dataset is curated from more than 20 medical groups under the license permission, including MSD~\cite{simpson2019MSD}, KiTS~\cite{KiTS,KiTSDataset}, AbdomenCT-1K~\cite{AbdomenCT-1K}, and TCIA~\cite{clark2013TCIA}. The training set includes 50 labelled CT scans with pancreas disease and 2000 unlabelled CT scans with liver, kidney, spleen, or pancreas diseases. The validation set includes 50 CT scans with liver, kidney, spleen, or pancreas diseases.
The testing set includes 200 CT scans where 100 cases has liver, kidney, spleen, or pancreas diseases and the other 100 cases has uterine corpus endometrial, urothelial bladder, stomach, sarcomas, or ovarian diseases. All the CT scans only have image information and the center information is not available.

The evaluation measures consist of two accuracy measures: Dice Similarity Coefficient (DSC) and Normalized Surface Dice (NSD), and three running efficiency measures: running time, area under GPU memory-time curve, and area under CPU utilization-time curve. All measures will be used to compute the ranking. Moreover, the GPU memory consumption has a 2 GB tolerance.


\subsection{Implementation details}
\subsubsection{Environment settings}
The development environments and requirements are presented in Table~\ref{table:env}.


\begin{table}[!htbp]
\caption{Development environments and requirements.}\label{table:env}
\centering
\begin{tabular}{ll}
\hline
Windows/Ubuntu version       & Ubuntu 18.04.4 LTS\\
\hline
CPU   & Intel(R) Xeon(R) Gold 6226 CPU @ 2.70GHz \\
\hline
RAM                         &12$\times $32GB; 2.67MT$/$s\\
\hline
GPU (number and type)                         & 8$\times$ NVIDIA GeForce RTX 2080Ti\\
\hline
CUDA version                  & 11.1\\                          \hline
Programming language                 & Python 3.6.10\\ 
\hline
Deep learning framework & Pytorch (Torch 1.7.0, torchvision 0.8.0) \\
\hline
Specific dependencies         &     monai 0.8.0                   \\                                                                      
\hline
(Optional) Link to code     &        \url{https://github.com/SeanCho1996/MeanTeacher3dUNet}                                  \\
\hline
\end{tabular}
\end{table}


\subsubsection{Training protocols}
A refined training parameters are shown in Table~\ref{table:training}.

In the training phase we perform a series of augmentation on the input data to improve the robustness of the model.
\begin{itemize} 
 \item Random Affine: In this stage we add random rotation and scale transformation.
 \item Cropping strategy: The cropping strategy is different for labeled and unlabeled training data: for labeled data, the foreground patches are randomly cropped according to the value of the labels, and conversely for unlabeled data, a completely random cropping is used. 
 Patch size is fixed to (128, 128, 16)
 \item Other augmentation methods: random Gaussian noise as well as random flip in the three axes.
\end{itemize}


\begin{table*}[!htbp]
\caption{Training protocols.}
\label{table:training}
\begin{center}
% \resizebox{0.47\textwidth}{!}{
\begin{tabular}{ll} 
\hline
Network initialization         & "he" normal initialization\\
\hline
Batch size                    & 8 * 3 samples per image \\
\hline 
Patch size & 128$\times$128$\times$16  \\ 
\hline
Total epochs & 1000 \\
\hline
Optimizer          & Adam         \\ \hline
Initial learning rate (lr)  & 1e-4 \\ \hline
Lr decay schedule & / \\
\hline
Training time                                           & 15 hours \\  \hline 
Number of model parameters    & 3.5M\footnote{https://github.com/sksq96/pytorch-summary} \\ \hline
Number of flops & 30.27G\footnote{https://github.com/facebookresearch/fvcore} \\ \hline
CO$_2$eq & 1 Kg\footnote{https://github.com/lfwa/carbontracker/} \\  \hline
\end{tabular}
%}
\end{center}
\end{table*}


% \begin{table*}[!htbp]
% \caption{Training protocols for the refine model (if using two-stage framework).}
% \label{table:training2nd}
% \begin{center}
% % \resizebox{0.47\textwidth}{!}{
% \begin{tabular}{ll} 
% \hline
% Network initialization         & ``he" normal initialization\\
% \hline
% Batch size                    & 2 \\
% \hline 
% Patch size & 80$\times$192$\times$160  \\ 
% \hline
% Total epochs & 1000 \\
% \hline
% Optimizer          & SGD with nesterov momentum ($\mu=0.99$)          \\ \hline
% Initial learning rate (lr)  & 0.01 \\ \hline
% Lr decay schedule & halved by 200 epochs \\
% \hline
% Training time                                           & 72.5 hours \\  \hline 
% Number of model parameters    & 41.22M\footnote{https://github.com/sksq96/pytorch-summary} \\ \hline
% Number of flops & 59.32G\footnote{https://github.com/facebookresearch/fvcore} \\ \hline
% CO$_2$eq & 1 Kg\footnote{https://github.com/lfwa/carbontracker/} \\  \hline
% \end{tabular}
% \end{center}
% \end{table*}


\section{Results and discussion}
% Note: Please describe at least the following aspects:\\
% The effect of using unlabelled cases;\\
% What kind of cases the proposed method works well?\\
% What are the possible reasons for the failed cases or organs?\\
% Segmentation efficiency analysis\\


\subsection{Quantitative results on validation set}
% Currently, you can report the Dice score on validation set
The overall quantitative results are shown in Table \ref{tab:quanti-validation}.

Table \ref{tab:dsc-comparison} illustrates the results of either using the unlabeled data or not.
It can be easily seen that the semi-supervised model outperforms the fully supervised model using only labeled data on all other classes except Pancreas and Duodenum with a subtle advantage of $\sim$0.6\%.
The generalization of the model is greatly enhanced due to the use of unlabeled data, coupled with a wide variety of data augmentations.


\begin{table}[!htbp]
\caption{Quantitative results on validation set.}
\setlength{\tabcolsep}{10mm}
\label{tab:quanti-validation}
\centering
\begin{tabular}{ccc}
\hline
Organ               & DSC(\%)            & NSD (\%)        \\
\hline
Liver               & 81.56$\pm$17.07     & 72.48$\pm$19.02  \\
Right Kidney        & 69.03$\pm$24.96     & 61.02$\pm$24.81  \\
Spleen              & 76.03$\pm$19.49     & 67.03$\pm$20.68  \\
Pancreas            & 54.87$\pm$14.79     & 65.90$\pm$14.47  \\
Aorta               & 79.94$\pm$12.27     & 76.32$\pm$14.21  \\
Inferior Vena Cava  & 68.10$\pm$14.09     & 58.75$\pm$14.45  \\
Right Adrenal Gland & 38.55$\pm$17.90     & 51.25$\pm$19.69  \\
Left Adrenal Gland  & 35.97$\pm$20.06     & 47.41$\pm$23.77  \\
Gallbladder         & 32.81$\pm$27.87     & 24.31$\pm$21.31  \\
Esophagus           & 54.05$\pm$15.88     & 65.11$\pm$16.99  \\
Stomach             & 57.32$\pm$19.91     & 53.76$\pm$19.39  \\
Duodenum            & 46.20$\pm$15.99     & 66.54$\pm$17.25  \\
Left Kidney         & 71.78$\pm$22.57     & 64.17$\pm$24.43  \\
\hline
Mean                & 58.93$\pm$18.68     & 59.54$\pm$19.27  \\
\hline
\end{tabular}
\end{table}

\begin{table}[!htbp]
\caption{DSC(\%) comparison on validation set.}
\setlength{\tabcolsep}{5mm}
\label{tab:dsc-comparison}
\centering
\begin{tabular}{ccc}
\hline
Organ    & with unlabeled data       & without unlabeled data        \\
\hline
Liver               & \textbf{85.58}  & 80.81  \\
Right Kidney        & \textbf{71.69}  & 67.66  \\
Spleen              & \textbf{76.07}  & 72.87  \\
Pancreas            & 53.93  & \textbf{54.30}  \\
Aorta               & \textbf{79.62}  & 77.67  \\
Inferior Vena Cava  & \textbf{68.40}  & 66.84  \\
Right Adrenal Gland & \textbf{38.06}  & 37.05  \\
Left Adrenal Gland  & \textbf{37.91}  & 33.03  \\
Gallbladder         & \textbf{34.22}  & 29.86  \\
Esophagus           & \textbf{57.83}  & 53.67  \\
Stomach             & \textbf{61.89}  & 50.18  \\
Duodenum            & 45.41  & \textbf{46.04}  \\
Left Kidney         & \textbf{72.22}  & 63.82  \\
\hline
Mean       & \textbf{60.22}  & 56.44  \\
\hline
\end{tabular}
\end{table}


% Please do ablation study to analysis the effect of unlabelled data.


\subsection{Qualitative results on validation set}
\label{sec:qual}
% This part is optional during validation phase since you do not have validation ground truth.
At the image level, we find that our model performs well in processing test images that are isotropic with labeled data, as shown in Figures~\ref{fig:figures-28} and \ref{fig:figures-30}.
The dimensions of these two images are (512, 512, 96) and (512, 512, 89), respectively, while the average size of the labeled data is approximately (512, 512, 100).
Conversely, for images anisotropic with labeled data, as shown in Figures~\ref{fig:figures-18} and \ref{fig:figures-02}, our model performs relatively poorly in this case.
The dimensions of these two images are (512, 512, 203) and (512, 512, 171), respectively, and the scale in the coronal direction is almost twice of the labeled data.
The reason for this situation is that in order to reduce the resource consumption of the model, we set the spacing of preprocessing relatively large, and in the process of downsampling, too much information is lost from these large scale images, resulting in their features not being easily computed.

At the organ level, for targets with fixed shapes and large volumes, such as the right and left kidneys, the liver, and the spleen, it can be seen that our model performs well.
In addition our model performs well for targets with fixed positions, such as the aorta and inferior vena cava.
By observing the images we found that our model does not perform well when dealing with smaller scale targets, especially for (left and right) adrenal glands and gallbladder.
This is fully explainable because as we set a large spacing, the feature representation would inevitably be weakened of small-scale targets.

\begin{figure}[htb!]
	\centering
    \begin{subfigure}{0.3\textwidth}
        \includegraphics[width=\textwidth]{imgs/good/case30/gt/coronal.png}
        \caption{Coronal Plane-GT}
        \label{fig:30-cgt}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.3\textwidth}
        \includegraphics[width=\textwidth]{imgs/good/case30/gt/sagittal.png}
        \caption{Sagittal Plane-GT}
        \label{fig:30-sgt}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.3\textwidth}
        \includegraphics[width=\textwidth]{imgs/good/case30/gt/axial.png}
        \caption{Axial Plane-GT}
        \label{fig:30-tgt}
    \end{subfigure}
    \begin{subfigure}{0.3\textwidth}
        \includegraphics[width=\textwidth]{imgs/good/case30/val/coronal.png}
        \caption{Coronal Plane-Pred}
        \label{fig:30-cp}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.3\textwidth}
        \includegraphics[width=\textwidth]{imgs/good/case30/val/sagittal.png}
        \caption{Sagittal Plane-Pred}
        \label{fig:30-sp}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.3\textwidth}
        \includegraphics[width=\textwidth]{imgs/good/case30/val/axial.png}
        \caption{Axial Plane-Pred}
        \label{fig:30-tp}
    \end{subfigure}
    \caption{Standard Validation Case 00030}
    \label{fig:figures-30}
\end{figure}


\begin{figure}[htb!]
	\centering
    \begin{subfigure}{0.3\textwidth}
        \includegraphics[width=\textwidth]{imgs/good/case28/gt/coronal.png}
        \caption{Coronal Plane-GT}
        \label{fig:28-cgt}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.3\textwidth}
        \includegraphics[width=\textwidth]{imgs/good/case28/gt/sagittal.png}
        \caption{Sagittal Plane-GT}
        \label{fig:28-sgt}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.3\textwidth}
        \includegraphics[width=\textwidth]{imgs/good/case28/gt/axial.png}
        \caption{Axial Plane-GT}
        \label{fig:28-tgt}
    \end{subfigure}
    \begin{subfigure}{0.3\textwidth}
        \includegraphics[width=\textwidth]{imgs/good/case28/val/coronal.png}
        \caption{Coronal Plane-Pred}
        \label{fig:28-cp}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.3\textwidth}
        \includegraphics[width=\textwidth]{imgs/good/case28/val/sagittal.png}
        \caption{Sagittal Plane-Pred}
        \label{fig:28-sp}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.3\textwidth}
        \includegraphics[width=\textwidth]{imgs/good/case28/val/axial.png}
        \caption{Axial Plane-Pred}
        \label{fig:28-tp}
    \end{subfigure}
    \caption{Standard Validation Case 00028}
    \label{fig:figures-28}
\end{figure}

\begin{figure}[htb!]
	\centering
    \begin{subfigure}{0.3\textwidth}
        \includegraphics[width=\textwidth]{imgs/bad/case18/gt/coronal.png}
        \caption{Coronal Plane-GT}
        \label{fig:18-cgt}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.3\textwidth}
        \includegraphics[width=\textwidth]{imgs/bad/case18/gt/sagittal.png}
        \caption{Sagittal Plane-GT}
        \label{fig:18-sgt}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.3\textwidth}
        \includegraphics[width=\textwidth]{imgs/bad/case18/gt/axial.png}
        \caption{Axial Plane-GT}
        \label{fig:18-tgt}
    \end{subfigure}
    \begin{subfigure}{0.3\textwidth}
        \includegraphics[width=\textwidth]{imgs/bad/case18/val/coronal.png}
        \caption{Coronal Plane-Pred}
        \label{fig:18-cp}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.3\textwidth}
        \includegraphics[width=\textwidth]{imgs/bad/case18/val/sagittal.png}
        \caption{Sagittal Plane-Pred}
        \label{fig:18-sp}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.3\textwidth}
        \includegraphics[width=\textwidth]{imgs/bad/case18/val/axial.png}
        \caption{Axial Plane-Pred}
        \label{fig:18-tp}
    \end{subfigure}
    \caption{Bias Validation Case 00018}
    \label{fig:figures-18}
\end{figure}

\begin{figure}[htb!]
	\centering
    \begin{subfigure}{0.3\textwidth}
        \includegraphics[width=\textwidth]{imgs/bad/case02/gt/coronal.png}
        \caption{Coronal Plane-GT}
        \label{fig:02-cgt}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.3\textwidth}
        \includegraphics[width=\textwidth]{imgs/bad/case02/gt/sagittal.png}
        \caption{Sagittal Plane-GT}
        \label{fig:02-sgt}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.3\textwidth}
        \includegraphics[width=\textwidth]{imgs/bad/case02/gt/axial.png}
        \caption{Axial Plane-GT}
        \label{fig:02-tgt}
    \end{subfigure}
    \begin{subfigure}{0.3\textwidth}
        \includegraphics[width=\textwidth]{imgs/bad/case02/val/coronal.png}
        \caption{Coronal Plane-Pred}
        \label{fig:02-cp}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.3\textwidth}
        \includegraphics[width=\textwidth]{imgs/bad/case02/val/sagittal.png}
        \caption{Sagittal Plane-Pred}
        \label{fig:02-sp}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.3\textwidth}
        \includegraphics[width=\textwidth]{imgs/bad/case02/val/axial.png}
        \caption{Axial Plane-Pred}
        \label{fig:02-tp}
    \end{subfigure}
    \caption{Bias Validation Case 00002}
    \label{fig:figures-02}
\end{figure}


\subsection{Quantitative results on test set}
% Currently, you can report the Dice score on validation set
The overall quantitative results on test set are shown in Table \ref{tab:quanti-test}.


\begin{table}[!htbp]
\caption{Quantitative results on test set.}
\setlength{\tabcolsep}{10mm}
\label{tab:quanti-test}
\centering
\begin{tabular}{ccc}
\hline
Organ               & DSC(\%)            & NSD (\%)        \\
\hline
Liver               & 80.12$\pm$10.56     & 68.61$\pm$14.39  \\
Right Kidney        & 64.59$\pm$24.25     & 55.27$\pm$24.22  \\
Spleen              & 73.44$\pm$22.86     & 65.14$\pm$22.52  \\
Pancreas            & 50.61$\pm$16.43     & 63.11$\pm$17.45  \\
Aorta               & 78.00$\pm$14.41     & 74.46$\pm$16.35  \\
Inferior Vena Cava  & 67.08$\pm$15.68     & 59.61$\pm$16.41  \\
Right Adrenal Gland & 41.91$\pm$15.82     & 56.39$\pm$19.22  \\
Left Adrenal Gland  & 38.44$\pm$19.45     & 51.00$\pm$23.74  \\
Gallbladder         & 35.14$\pm$26.80     & 25.82$\pm$19.69  \\
Esophagus           & 52.70$\pm$15.39     & 64.44$\pm$16.50  \\
Stomach             & 52.73$\pm$19.33     & 48.11$\pm$18.38  \\
Duodenum            & 41.87$\pm$15.75     & 62.04$\pm$16.08  \\
Left Kidney         & 68.61$\pm$14.39     & 60.38$\pm$23.44  \\
\hline
Mean                & 58.14$\pm$18.47     & 58.03$\pm$19.11  \\
\hline
\end{tabular}
\end{table}


\subsection{Segmentation efficiency results}
\label{sec:effi}
For the efficiency of segmentation, our model predicted 50 validation images using about 5 minutes, which we think is a relatively acceptable time.
For the majority of validated images, the time used to predict individual results was within 11 seconds (the mean inference time on the validation set of our method is 11.56 seconds), but for images with large scales, our method used up to 45.45 seconds
Although we increased the spacing of the input data to make the image array size smaller, we had to sacrifice the patch size to reduce the GPU memory usage (with a mean of 2036.04 MB and a max of 2067 MB), resulting in a larger number of patches, so our final prediction time is similar to the performance of nnUnet.


\subsection{Limitations and future work}
As mentioned in Section \ref{sec:qual} and Section \ref{sec:effi}, our model had to compromise the spacing after resampling and the size of the patches entering the neural network in order to improve the computational speed and reduce the computational consumption, which resulted in our model's ability to handle small-scale targets becoming extremely poor.

To solve this problem, our subsequent work has two general directions: one is to reduce the spacing appropriately to find the optimal parameter settings to balance the computational consumption and accuracy (we have tried smaller spacing, which will undoubtedly improve the segmentation accuracy significantly), and the other is to use a cascade model following nnUNet's practice to add an additional neural network structure for small-size targets.

In addition to optimization in terms of network structure, we can also do more experiments in data augmentation methods. 
At this stage, we have only used conventional and simple data augmentation methods. 
Due to time constraints, we did not have time to implement more complex enhancement methods such as CutOut or CutMix.





\section{Conclusion}
In conclusion, this work uses the classical Unet model and the Mean Teacher strategy to implement a semi-supervised abdominal organ segmentation task. 
We do not use complex model structures or difficult-to-deploy usage methods for unlabeled data because we adhere to the idea that for medical images, which usually have relatively fixed structures, good results should be obtained even using simple designs. 
This idea is also in line with the core idea of the nnUnet model~\cite{isensee2021nnu}, which has been most widely used in recent years.
In addition, we slightly sacrifice the accuracy of small target segmentation to obtain a smaller model size and less computational resources.


\subsubsection{Acknowledgements} The authors of this paper declare that the segmentation method they implemented for participation in the FLARE 2022 challenge has not used any pre-trained models nor additional datasets other than those provided by the organizers.


%
% ---- Bibliography ----
%
% BibTeX users should specify bibliography style 'splncs04'.
% References will then be sorted and formatted in the correct style.
%
\bibliographystyle{splncs04}
\bibliography{ref}

\end{document}
