% A detail description of the method used, a schematic representation of the method is recommended.

% A clear description of preprocessing
% A clear description of the proposed method; importantly, the authors should clarify the
% method to use unlabelled images
% A figure of network architecture
% Strategies to improve inference speed and reduce resource consumption
% A clear description of post-processing


\subsection{Preprocessing}
\label{sec:preproc}

\begin{figure}[t]
\centering
\includegraphics[width=\textwidth]{imgs/preproc.pdf}
\caption{Windowing CT}
\label{fig:preproc}
\end{figure}

% Because the Hounsfield Unit scale has a very wide range, and might not perform well with the approach on conventional visual models, we apply a window based on prior knowledge to display map to 256 range. The motivation is with disease tissue it is only visible when applying windows, if we use raw data, the diseased tissue is mixed with normal tissue, it is almost impossible to extract information from the visual image.

For preprocessing, we apply the Windowing technique  \cite{windowingct17yahya} with different levels and widths to target specific parts of human organs. Windowing, also known as grey-level mapping, contrast stretching, histogram modification, or contrast enhancement is the process in which the CT image grayscale component of an image is manipulated via the CT numbers; doing this will change the appearance of the picture to highlight particular structures. The brightness of the image is adjusted via the window level. The contrast is adjusted via the window width. In our experiments, we create 3 different versions of a single slice by highlighting the abdomen, chest, and spine groups and stacking them to one as a three-channel image (Fig. \ref{fig:preproc}). 

In addition, we choose the axial plane to cut the slices from the CT volumes since this plane has various dimension sizes. 
Due to some relatively small organs, it might be better to keep the original size of the slices without any cropping, resampling, or resizing methods.
The image is rotated to a predefined angle, then divided by 255 for normalization before going through the next step.


\subsection{Proposed Method}
\label{sec:method}

\begin{figure}[htbp]
\centering
\includegraphics[width=\textwidth]{imgs/overview.pdf}
\caption{Our overall proposed pipeline. Firstly, the entire CT Volume is processed using windowing CT to get a stack of three-channel slices. Then the slices progress through the Reference module to obtain a minimal number of preliminary masks. Lastly, the Propagation module refine these initial masks to finalize the result.}
\label{fig:method}
\end{figure}

Our method composes of two main modules: the Reference module and the Propagation module, as can be seen in Fig \ref{fig:method}. 

In the beginning, we uniformly select only $k$ slices from the CT Volume to be our initial candidates. Next, these slices are processed by using the Windowing technique (described in Section \ref{sec:preproc}). Afterward, these slices are put through the Reference module (described in Section \ref{sec:reference}), which performs the standard multiclass segmentation, then the preliminary $k$ masks can be obtained. 
With these pairs of potential slices and masks as prior knowledge, the Propagation module (described in \ref{sec:propagation}) can utilize them to propagate the objects' transformation information to the remaining slices across the CT volume length. The final output of this module is a 3D dense mask prediction, with each voxel indicating a class.


\subsubsection{Reference module}
\label{sec:reference}

This module is expected to provide a suggestion of a minimal amount of slices and predicted masks that might contain the most information describing the entire CT Volume. Fig \ref{fig:reference} describes the details of this module.

\begin{figure}[htbp]
\centering
\includegraphics[width=\textwidth]{imgs/reference.pdf}
\caption{The reference module. The semi-supervised technique CPS is applied in both the training and inference stage to enhance the precision of model prediction. Strategies are used to smartly choose slices that are informative for the next stage. }
\label{fig:reference}
\end{figure}

To utilize the enormous number of unlabeled data, we apply the recent semi-supervised method that performs effectively on several other datasets, which is called Cross Pseudo Supervision (CPS) \cite{cps21chen} (yellow cube in Fig. \ref{fig:reference}).
CPS enables the usage of unlabeled data by following the dual students technique, where two models are trained simultaneously on labeled data while generating pseudo data for their "peer" to learn. In the testing phase, two models predict the same image, and the result is aggregated by summing up.

We adopt two prominent state-of-the-arts 2D segmentation models with highly different learning paradigms for this CPS framework, which is TransUNet \cite{transunet21chen} and DeeplabV3+ \cite{dlv3p18chen}. While DeeplabV3+ traditionally focuses more on the local information, transformers model the long-range relation, so the cross training can help to learn a unified segmenter with these two properties at the same time. In short, we choose TransUNet and DeeplabV3+ due to their ability to compensate each other for better performance. \cite{crosscnntransformer21luo}

In addition, we also propose a both logical and specialist-based strategy to choose which slices can be further used to boost the performance of the Propagation module. The goal of this action is to preserve only some of the most useful information for the refinement stage.

To elaborate on these strategies, prior to being put into the CPS module for prediction, a small number of slices are uniformly sampled from the processed CT volume. After CPS produces segmentation masks for these slices, another selection step is performed to pick only some of the masks that contain the organs having the largest areas.  

Although we have employed a semi-supervised learning technique for the Reference module, it still lacks information on the axial plane of the CT volume. Therefore, we simply resolve that by embedding the slices's relative position on the axial dimension as a feature vectors and input them to both networks TransUnet and DeeplabV3+ to learn. The rationale behind this is that, with additional temporal knowledge, models are expected to capture the position constraint for each organ's appearance, hence provide better prediction.

\textbf{\textit{Positional Encoding}}. In order for the model to make use of the order of the sequence, we need to inject some information about the positions of the slices. For simplicity, we add an additional embedding layer to embed the relative position of each slice. Specifically, the embedded position index is concatenated with the hidden features before the final segmentation head. The relative position of $k^{th}$ frame of CT volume $i$ with length $T_i$ is calculated as:

\begin{align}
        PE(k) &= \frac{k}{T_i}
\end{align}

We attach this layer to both DeeplabV3+ and TransUnet. Since they follow the conventional structure of segmentation models, which comprise of encoding and decoding phases, we manage to attach the layer in a similar way for both of them, as can be described in Figure \ref{fig:pe_arch}

\begin{figure}[!h]
    \centering
    \includegraphics[width=\textwidth]{imgs/pe.pdf}
    \caption{A general and simple way to attach a Positional Encoding branch into segmentation models . }
    \label{fig:pe_arch}
\end{figure}

\subsubsection{Propagation module}
\label{sec:propagation}

This module aims to utilize prior knowledge of given annotated slices from the Reference module to make prediction on the remaining slices, this mechanism can be referred as mask (or label) propagation.

Intuitively, the conventional 2D CNNs cannot comprehend the third dimension information within a CT volume. Thus, in hope of the ability to capture the "temporal" information along the axial plane, we adapt the Space-Time Correspondence Networks (STCN) \cite{stcn21cheng}, which is a semi-supervised segmentation algorithm that has achieved promising results on Video object segmentation problem, to this 3D manner. 

\begin{figure}[htbp]
\centering
\includegraphics[width=\textwidth]{imgs/propagation.pdf}
\caption{The propagation module. From an annotated slice of CT, at timestep T, STCN can make use of that to spread the information through the entire defined range $[T-k_1, T+k_2]$.}
\label{fig:propagation}
\end{figure}

Basically, STCN proposes the use of a memory bank that stores information about previous frames and their corresponding masks and uses them later as prior knowledge. To generate the mask for the current frame, a pairwise affinity matrix is calculated between the query frame and memory frames based on negative squared Euclidean distance, then it is used for supporting the current mask generation. \cite{stcn21cheng}

Different from the original STCN, we slightly modify it to match the current problem. In the original work, they use only a single dense mask to propagate through the entire video, therefore for the model to perfectly work, that selected mask must contain information about all available classes. For our case to achieve that goal, we enable the usage of multiple masks for propagation, so that all of these masks should contain enough information about every organs. We also allow the STCN to work in a bidirectional way to enhance the refinement. Fig \ref{fig:propagation} illustrates this process.

Specially, STCN can be simply trained in the binary manner, meaning that each of the abdominal organs can be learned separately. Therefore, the knowledge can be transferred well between different organ classes.


\subsubsection{Pseudo labeling with Uncertainty Estimation}
\label{sec:labeling}

Given a vast amount of unlabeled CT volumes, we apply a uncertainty estimation technique to effectively maximize the utilization of the data.

Firstly, several CPS models are trained on the provided labeled data.
Then, we use these trained CPS models to obtain pseudo masks on the unlabeled set. Inspired from \cite{wang2019active}, we calculate the dice scores between these pseudo masks and the aggregated one. The mean of these dice scores will be compared with a threshold to determine whether the aggregated pseudo masks are qualified. Simply speaking, consensus-based assessment is used to evaluate the quality of pseudo labels.

We determine a single score for the $i^{th}$ volume in the unlabeled set as the formulation below:

\begin{align}
        score_{i} &= \frac{1}{K \times M} \sum_{k=1}^{K^{i}} \sum_{m=1}^{M} \text{DSC}(\mathcal{Y}^{k,i}_{m}, \mathcal{Y}^{k,i}_{AVG}) \\
        \text{dsc}  &= \frac{2 |X \cap Y|}{|X| + |Y|}
\end{align}

DSC represents the Dice Score evaluation metric calculating the overlapping area of prediction $X$ and ground truth $Y$. 
Here $\mathcal{Y}^{k,i}_{m}$ indicates the $m^{th}$ model’s output of the $k^{th}$ slice of volume $i$ while $\mathcal{Y}^{k,i}_{AVG}$ is the mask averaged from all $M$ models for the same slice. The easier the sample is, the more inclined the segmentation models are to get similar outputs. In contrast, hard samples are more likely to be segmented differently by different models. Hence, we use the proposed score to measure the certainty between models' predictions. A higher score gives more credibility to the prediction, as it is more consistent. 

All aggregated samples that have high certainty are then reused for the next supervised training cycle. And after the training finishes, the same labeling process is repeated until all aforementioned models achieve satisfactory performance or every unlabeled data has been used.


\subsubsection{Loss function}
\label{sec:loss}

For the Reference module, we use the prevalent combination of dice loss and cross-entropy loss with smoothing value to alleviate the imbalanced number of the small organs, which occurs due to our splitting into slices process. The same settings are used for CPS in its supervised branch whereas only the dice loss is set up for the unsupervised branch. 

For the Propagation module, we implement the online hard example cross entropy (OhemCE or Bootstrapping CE) \cite{ohemce16wu} and also calculate the Lovasz loss \cite{lovasz18berman} at the same time. OhemCE can help reduce the contribution of the background label to the final loss. And since STCN is trained on the binary task, OhemCE can direct the model to focus on visible difficult objects. Meanwhile, Lovasz loss is commonly used in past research and competitions. 

\subsection{Post-processing}
% \textit{Description of post-processing of the model outputs to get the final output in training stage.}\\

We do not use any post-processing techniques because no complex pre-processing ones are used, and we conduct all our experiments on the original-sized image volumes apart from the orientation settings. Thus, before submitting it to the evaluation system, the mask must be transformed back to the original orientation.

\subsection{Inference Optimization}
Unfortunately, we do not apply any engineering technique to reduce resource consumption nor speed up inference process. 