% The validation performance (DSC) comparisons between with and without using unlabelled
% images
% Visualized examples of successful and failed cases
% Segmentation efficiency analysis
% Limitations and future work


\subsubsection{Quantitative results}

Here we present both quantitative and qualitative results of our proposed method. We also include the ablation study (Table \ref{table:ablation_study}) to further analyze the effectiveness of each of our modules. 

\begin{table}[!h]
\centering
\caption{Comparison between using and not using the pseudo labels as supervised training data. The model that is used for the report is TransUnet on public test set. The highlighted figures emphasize the highest values in each row.}
\footnotesize
\begin{tabular}{ | c | c | c | c | c }
\hline
\textbf{Number of pseudo-labeled samples} & 0 & 200 & 700  \\
\hline
\textbf{Liver} & 0.9215  & 0.9555 & \textbf{0.9604} \\
\textbf{Right Kidney (RK)} & 0.6548 & 0.7944 & \textbf{0.8014} \\
\textbf{Spleen} & 0.8159 & 0.9144 & \textbf{0.9255}	\\
\textbf{Pancreas} & 0.6235 & 0.7309 & \textbf{0.7567}	\\
\textbf{Aorta} & 0.8794 & 0.9272 & \textbf{0.9335} \\
\textbf{Inferior Vena Cava (IVC)} & 0.7145 & 0.7967 & \textbf{0.8207} 	\\
\textbf{Right Adrenal Gland (RAG)} & 0.4688 & 0.6507 & \textbf{0.6545} 	\\
\textbf{Left Adrenal Gland (LAG)} & 0.4209 & \textbf{0.6179} & 0.6138 	\\
\textbf{Gallbladder} & 0.4798 & \textbf{0.5889} & 0.5885	\\
\textbf{Esophagus} & 0.7086 & \textbf{0.7784} & 0.7783 	\\
\textbf{Stomach} & 0.7446 & 0.8403 & \textbf{0.8424} 	\\
\textbf{Duodenum} &	0.4387 & 0.5617 & \textbf{0.5679}	\\
\textbf{Left Kidney (LK)} & 0.6763 & \textbf{0.8112} & 0.8026 \\ 
\textbf{Mean DSC} & 0.6575 & 0.7668 & \textbf{0.7728} \\ 

\hline
\end{tabular}
\label{table:unlabeled}
\vspace{-2mm}
\end{table}

Some interesting insights can be spotted in Table \ref{table:unlabeled}. Overall, we can see that using the pseudo-labeled data for training, helps boost the performance of the model by a great amount. Unfortunately, we have yet to fully explore every unlabeled sample (only 700 samples were used for training in our submission), but intuitively, the number of used unlabeled samples is likely to be directly proportional to the evaluation result. Another notable observation is that the DSC for some small human organs (gallbladder and adrenal glands) can hardly be improved because of the class imbalance problem (as referred in \ref{sec:limitation}). 

\begin{table}[!h]
\centering
\caption{Ablation experiment on each proposed modules and techniques.}
\footnotesize
\begin{tabular}{ | c | c | c | c | c | c | c |}
\hline
\textbf{No. } & \textbf{\thead{Positional \\ Encoding}} & \textbf{CPS}  &  \textbf{\thead{Uncertainty \\ Estimation}} & \textbf{\thead{Mask \\ Propagation}} & \textbf{\thead{Mean \\ DSC}} \\ 
\hline
1  &              & &  & & 0.6419   \\
2  & $\checkmark$ & &  & & 0.6575 \\
3  & $\checkmark$ & $\checkmark$ &              & & 0.762   \\
4  & $\checkmark$ & $\checkmark$ & $\checkmark$ & & 0.7728   \\
5  & $\checkmark$ & $\checkmark$ & $\checkmark$ & $\checkmark$ & \textbf{0.784}   \\
\hline
\end{tabular}
\label{table:ablation_study}
\vspace{-2mm}
\end{table}

Table \ref{table:ablation_study} shows that each module contributes to the final score of our submission. The baseline model that is reported in the first row is TransUnet. The Cross Pseudo Supervision (CPS) refers to using both DeeplabV3+ and TransUnet as training models. The third and fourth rows where both CPS and Uncertainty Estimation (UE) is used mean that pseudo-labels that are qualified by UE are used as supervised inputs in CPS workflow, whereas the remaining unlabeled data are used as unsupervised inputs. Noticeably, in the fourth row, with the Mask propagation (MP) applied, DSC score is enhanced substantially. It is surprising that MP only looks upon the minority of the slices to fully propagate through the whole volume. The detailed evaluation for our best submission is shown in Table  \ref{table:submission}.

% \begin{table}[!h]
% \centering
% \caption{}
% \footnotesize
% \begin{tabular}{ | c | c | c | c | c | c | c |}
% \hline
% \textbf{Mean DSC} & \textbf{Liver} & \textbf{RK} & \textbf{Spleen} &	\textbf{Pancreas} &	\textbf{Aorta} &	\textbf{IVC} \\
% \hline
% 0.7841 &	0.9591 &	0.8149 &	0.9244 &	0.7499 &	0.9383 &	0.8262 \\
% \hline
% \hline
% \textbf{RAG} &	\textbf{LAG} &	\textbf{Gallbladder} &	\textbf{Esophagus} &	\textbf{Stomach} &	\textbf{Duodenum} &	\textbf{LK} \\
% \hline
% 0.6456 &	0.601 &	0.6877 &	0.7986 &	0.8446 &	0.5808 &	0.8219   \\

% \hline
% \end{tabular}
% \label{table:submission}
% \vspace{-2mm}
% \end{table}

\begin{table}[!h]
\centering
\caption{The final evaluation score for our final submission.}
\footnotesize
\begin{tabular}{ | c | c | c | c | c |}
\hline
\textbf{Classes/Metrics} &  \textbf{DSC} &  \textbf{NSD} \\
\hline
\textbf{Liver} & $0.974\pm0.036$  & $0.963\pm0.063$  \\
\textbf{Right Kidney (RK)} & $0.883\pm0.233$ & $0.868\pm0.241$ \\
\textbf{Spleen} & $0.9494\pm0.115$ & $0.935\pm0.134$ 	\\
\textbf{Pancreas} & $0.772\pm0.147$ & $0.877\pm0.145$ 	\\
\textbf{Aorta} & $0.96\pm0.045$ & $0.976\pm0.06$  \\
\textbf{Inferior Vena Cava (IVC)} & $0.86\pm0.123$ & $0.86\pm0.143$  	\\
\textbf{Right Adrenal Gland (RAG)} & $0.735\pm0.138$ & $0.855\pm0.144$ \\
\textbf{Left Adrenal Gland (LAG)} & $0.69\pm0.171$ & 0.$816\pm0.2$ \\
\textbf{Gallbladder} & $0.75\pm0.313$ & $0.733\pm0.328$	\\
\textbf{Esophagus} & $0.783\pm0.147$ & $0.88\pm0.143$ \\
\textbf{Stomach} & $0.86\pm0.113$ & $0.84\pm0.142$ 	\\
\textbf{Duodenum} &	$0.6\pm0.2$ & $0.79\pm0.215$ 	\\
\textbf{Left Kidney (LK)} & $0.877\pm0.22$ & $0.863\pm0.23$\\ 
\hline
\hline
\textbf{Mean} & $0.8233$ & $0.8668$\\ 
\hline
\end{tabular}
\label{table:submission}
\vspace{-2mm}
\end{table}


% Name	Liver_DSC	RK_DSC	Spleen_DSC	Pancreas_DSC	Aorta_DSC	IVC_DSC	RAG_DSC	LAG_DSC	Gallbladder_DSC	Esophagus_DSC	Stomach_DSC	Duodenum_DSC	LK_DSC	Liver_NSD	RK_NSD	Spleen_NSD	Pancreas_NSD	Aorta_NSD	IVC_NSD	RAG_NSD	LAG_NSD	Gallbladder_NSD	Esophagus_NSD	Stomach_NSD	Duodenum_NSD	LK_NSD
% AVG	0.973681	0.8830995	0.9494135	0.77222	0.959927	0.859974	0.735023	0.6904665	0.749906	0.783164	0.8598715	0.608984	0.877237	0.963386	0.868547	0.935567	0.877465	0.976524	0.8609845	0.854978	0.8158125	0.7331325	0.881696	0.844934	0.7914345	0.863412
% STD	0.03603279116	0.2333319662	0.1150219921	0.1470408096	0.04512374952	0.128918185	0.1377439032	0.1718526759	0.3129682132	0.1473998664	0.1132866631	0.2032467806	0.2204757629	0.06373675238	0.2410233366	0.1344171005	0.1457098029	0.05946526738	0.142912919	0.1442361113	0.2021023164	0.3286126883	0.1431496632	0.1426850621	0.2150018899	0.231313153

\subsubsection{Qualitative results}


Looking at examples that are well-predicted by our approach in Fig \ref{fig:qualitative} (1b, 2b, 3b), it demonstrates good segmentation masks with clear and smooth mask boundaries. Some small organs can also be seen segmented successfully and precisely meaning that both proposed modules can work effectively with organs having various sizes.

\begin{figure}[h]
\centering
\includegraphics[width=\textwidth]{imgs/qualitative.pdf}
\caption{Qualitative results from the validation set. We illustrate both well-segmented and challenging examples for our proposed segmentation pipeline}
\label{fig:qualitative}
\end{figure}

% Nhìn chung our model show những kết quả segmentation cao. Những case điểm cao nhất segment khá tốt hầu hết các cơ quan. Tuy nhiên, một số cơ quan có kích thước rất nhỏ như Right Adrenal Grand, Left Adrenal Grand cho kết quả segmentation chưa cao. Điều này có thể do vùng này dễ bị che lấp, kích thước nhỏ và việc xác định slice tốt nhất có vùng Adrenal Grand cho việc học lan truyền chưa thực sự ổn định. Duodenum có cấu tạo đặc trưng shape C có thể bị che lấp bởi các cơ quan như pancreas, liver, colon, làm ảnh hưởng đến việc segment của our model.

On the other hand, our models suffer from various difficult cases where organs are missing. Generally, there are two cases that negatively affects our approach:

\begin{enumerate}
    \item Relatively small organs (adrenal glands (Fig \ref{fig:qualitative} (1e)), gallbladder (Fig \ref{fig:qualitative} (1e)), and esophagus (Fig \ref{fig:qualitative} (3e))) account for the lowest DSC since they usually are failed to be identified by the Reference module.
    \item Other organs (pancreas (Fig \ref{fig:qualitative} (1e)) and duodenum (Fig \ref{fig:qualitative} (2e))) despite having larger size, yet their lengths on the axial plane are short and sometimes occluded by many surrounding organs, which can affects how the information propagating through the slices, causing class confusion in the result. 
\end{enumerate}

Furthermore, due to the our two-staged pipeline, for the results of the second stage to be good really relies on the first stage' performance.  If the reference stage miss-segments any organ, that one will be missed during the entire propagation process. Having said that, this issue mostly just occurs to organs that have short-size length on the axial plane.  


\subsubsection{Efficiency results}

Segmentation efficiency results are reported in Table \ref{table:efficiency}. GPU memory and GPU utilization is recorded every 0.1s. The Area under GPU memory-time curve and Area under CPU utilization-time curve are the cumulative values along running time.

\begin{table}[!h]
\centering
\caption{Efficiency evaluation from official report.}
\footnotesize
\begin{tabular}{ | c | c | c |}
\hline
\textbf{Running times (s)} & \textbf{AUC GPU} & \textbf{AUC CPU} \\ 
\hline
140.73 & 647605 & 3729  \\
\hline
\end{tabular}
\label{table:efficiency}
\vspace{-2mm}
\end{table}


% Row 1 Việc xác định slice có vùng Segment Pancreas trong our model chưa hiệu quả. Điều này có thể do việc train tại model 1 chưa tốt hoặc việc học lan truyền ở model 2 trên axial plane gặp nhiều khó khăn do các cấu trúc chồng lắp nhau tại vùng này

% Row 2: Ở trong ground truth, ta xác định được vị trí và vùng thực quản. Tuy nhiên, model của chúng tôi không xác định được thực quản với DSC của esophagus = 0. Điều này có thể do vùng thực quản trong file này khá nhỏ và có mô không đồng nhất để dò tìm ngay từ model 1.

% Inferior Vena Cava: 0.302984938, cơ quan này di chuyển từ ngực xuống vùng bụng và bị che lấp bởi nhiều cơ quan xung quanh. Điều này có thể ảnh hưởng trong quá trình segment lan truyền qua các slice, gây nhầm lẫn một phần khi segment 

% Row 3: Grouth truth có segment duodenum. Our model không segment được hoàn toàn duodenum. Vì vậy, Duodenum = 0. Điều này có thể do model 1 của chúng tôi đã xác định không có sự xuất hiện duodenum. Do đó, kết quả học tại model 2 không hiệu quả



\subsubsection{Limitation and future work}
\label{sec:limitation}

Apparently, although our proposed method has yet to achieve the high result, we believe it can be further improved if these limitation that we identify here are solved. 
First of all, the problem of imbalanced dataset has arisen because we perceive this as a 2D problem. Due to the slices splitting process, small organs (such as pancreas, gallbladder or adrenal glands) only appear in a small amount of slices, while larger objects have wider range of appearance. Therefore, it leads to the problem of imbalanced dataset. We tried some ways to tackle the problem, for instance: smart sampling, or imbalanced loss, however only slightly improvement was seen.
Secondly, the proposed approach is a two-stage method, the second stage is undoubtedly dependent of the first one. If there are any organs that are missed by the Reference module, it definitely cannot be recovered in the Propagation phase. Thus, more attention is needed for the Reference module.
In the future, it is encouraged to focus on boosting the performance of the Reference module by fully exploiting the temporal information.