
\section{Experiments and Results}


\subsection{Dataset}

This study involves two different datasets from two different institutes and scanners, for patients who underwent intensity-modulated radiotherapy for prostate cancer. The first dataset is from Haukeland Medical Center (HMC), Norway. The dataset has 18 patients with 8-11 CT scans, each corresponding to a treatment fraction. These scans were acquired using a GE scanner, and have 90 to 180 slices with a voxel size of approximately 0.9 $\times$ 0.9 $\times$ 2.0 mm. The second dataset is from Erasmus Medical Center (EMC), The Netherlands. This dataset consists of 14 patients with 3 follow-up CT scans each. The scans were acquired using a Siemens scanner, and have 91 to 218 slices with a voxel size of approximately 0.9 $\times$ 0.9 $\times$ 1.5 mm. The target structures (prostate and seminal vesicles) as well as organs-at-risk (bladder and rectum) were manually delineated by radiation oncologists. The networks were trained and validated on the HMC dataset, while the EMC dataset was used as an independent test set. Training was performed on a subset of 111 image pairs from 12 patients, and validation was carried out on the remaining 50 image pairs from 6 patients. All datasets were resampled to an isotropic voxel size of 1 $\times$ 1 $\times$ 1 mm.


\subsection{Implementation and Training Details}

We implemented the networks using TensorFlow~\cite{Tensorflow}.
The convolution layers were initialized from a random normal distribution with a mean of 0 and a standard deviation of 0.02, and the trainable alpha parameters of the cross-stitch units were initialized between 0 and 1 from a truncated random normal distribution with a mean of 0.5 and a standard deviation of 0.25.
%
The number of filters was set to \{16, 32, 64, 32, 16\} for the cross-stitch network and \{23, 45, 91, 45, 23\} ($\sqrt{2}$ times as many) for the other networks in order to ensure that each network has approximately the same number of trainable parameters, namely $7.8 \cdot 10^5$.
%
The patches were sampled equally from the organs-at-risk, the targets, and the remainder of the abdomen. 
We used the RAdam~\cite{RAdam} optimizer with a learning rate of $10^{-4}$.
The networks were trained for 200,000 iterations with an initial batch size of 2. In each batch, the training samples are doubled by switching the role of the fixed and moving patches, resulting in an effective batch size of four. 
The weights of the Dice and NCC losses were set to 1 and that of the bending energy loss to 0.5. For the total loss, all resolutions are weighted equally, namely $\frac{1}{3}$ each.
Training, validation and testing were performed on a Nvidia GTX1080 Ti GPU with 11 GB of memory.

\subsection{Evaluation Measures and Comparative Methods} 

The networks were evaluated in terms of their Mean Surface Distance (MSD) between the predicted segmentations and ground truth contours. The appendix contains results in terms of the DSC and the 95\% Hausdorff Distance (HD).

%
We compare the proposed approach to three state-of-the-art methods in abdominal CT radiotherapy: one iterative method, one deep learning method and one hybrid method.

%\begin{itemize}[leftmargin=*]\itemsep-0.3em
\begin{itemize}
  \item \textbf{Elastix}~\cite{Elastix, ElastixKlein}, a conventional iterative registration method. The Mutual Information similarity measure was used since it was found to perform better than the NCC similarity measure on the validation set. The transformation is parameterized by B-splines.

  \item \textbf{JRS-GAN}~\cite{JrsGan}, a deep learning approach that trains a registration network for contour propagation with a joint loss similar to our JRS-registration network, and a discriminator network for giving feedback on the warped images and contours.

  \item \textbf{Hybrid}~\cite{MedPhys}, a hybrid learning and iterative approach. A CNN network segments the bladder and feeds it to the registration model as prior knowledge of the underlying anatomy. It integrates domain-specific strategies such as gas pocket inpainting, contrast clipping and focused registration for the seminal vesicles and rectum.
\end{itemize}
The inference speed is less than a second for the deep learning methods, and in the order of minutes for the iterative and hybrid approaches.

\begin{table}[!t]
	\centering
	\setlength{\tabcolsep}{3pt}
	\caption[Table caption text]{MSD (mm) values for the different approaches on the HMC dataset. $\dagger$ denotes a significant difference (at $p=0.05$) between the cross-stitch network and the other networks.
	}
	\resizebox{\textwidth}{!}{
		\begin{tabular}{lrcccccccc} 
			&&\multicolumn{2}{c}{Prostate}&\multicolumn{2}{c}{Seminal vesicles}&\multicolumn{2}{c}{Rectum}& \multicolumn{2}{c}{Bladder} \\ \hline
			& Output Path & $\mu \pm \sigma$ & Median & $\mu \pm \sigma$ & Median & $\mu \pm \sigma$ & Median & $\mu \pm \sigma$ & Median \\ \hline
\multicolumn{2}{l}{ Segmentation }& $1.49 \pm 0.3^{\dagger}$ & 1.49 & $2.50 \pm 2.6^{\dagger}$ & 2.09 & $3.39 \pm 2.2^{\dagger}$ & 2.73 & $1.60 \pm 1.1^{\dagger}$ & 1.13 \\ \hline
\multicolumn{2}{l}{ Registration }& $1.43 \pm 0.8^{\dagger}$ & 1.29 & $1.71 \pm 1.4^{\dagger}$ & 1.37 & $2.44 \pm 1.1^{\dagger}$ & 2.17 & $3.40 \pm 2.3^{\dagger}$ & 2.71 \\ \hline
\multicolumn{2}{l}{ JRS-Registration }& $1.20 \pm 0.4^{\dagger}$ & 1.13 & $1.35 \pm 0.7~$ & 1.16 & $2.08 \pm 1.0^{\dagger}$ & 1.82 & $2.63 \pm 2.3^{\dagger}$ & 1.90 \\ \hline
Fully Hard Sharing & \textit{Segmentation} & $1.14 \pm 0.4^{\dagger}$ & 1.06 & \textcolor{gray}{$1.73 \pm 2.1~$} & \textcolor{gray}{1.12} & $1.91 \pm 0.9~$ & 1.64 & $1.04 \pm 0.7^{\dagger}$ & 0.87 \\
 & \textit{Registration} & \textcolor{gray}{$1.20 \pm 0.3^{\dagger}$} & \textcolor{gray}{1.11} & $1.33 \pm 0.7~$ & \textbf{1.10} & \textcolor{gray}{$2.16 \pm 1.1^{\dagger}$} & \textcolor{gray}{1.85} & \textcolor{gray}{$2.56 \pm 1.9^{\dagger}$} & \textcolor{gray}{1.90} \\ \hline
Cross-Stitch & \textit{Segmentation} & $\textbf{1.06} \pm \textbf{0.3}~$ & \textbf{0.99} & $\textbf{1.27} \pm \textbf{0.4}~$ & \textcolor{gray}{1.15} & $\textbf{1.76} \pm \textbf{0.8}~$ & \textbf{1.47} & $\textbf{0.91} \pm \textbf{0.4}~$ & \textbf{0.82} \\
 & \textit{Registration} & \textcolor{gray}{$1.10 \pm 0.3~$} & \textcolor{gray}{1.06} & \textcolor{gray}{$1.30 \pm 0.6~$} & 1.13 & \textcolor{gray}{$2.00 \pm 1.0~$} & \textcolor{gray}{1.75} & \textcolor{gray}{$2.45 \pm 2.1~$} & \textcolor{gray}{1.81} \\ \hline
\multicolumn{2}{l}{ Elastix~\cite{Elastix} }& $1.73 \pm 0.7^{\dagger}$ & 1.59 & $2.71 \pm 1.6^{\dagger}$ & 2.45 & $3.69 \pm 1.2^{\dagger}$ & 3.50 & $5.26 \pm 2.6^{\dagger}$ & 4.72 \\ \hline
\multicolumn{2}{l}{ JRS-GAN~\cite{JrsGan} }& $1.14 \pm 0.3^{\dagger}$ & 1.04 & $1.75 \pm 1.3^{\dagger}$ & 1.44 & $2.17 \pm 1.1^{\dagger}$ & 1.89 & $2.25 \pm 1.9^{\dagger}$ & 1.54 \\ \hline
\multicolumn{2}{l}{ Hybrid~\cite{MedPhys} }& $1.27 \pm 0.3^{\dagger}$ & 1.25 & $1.47 \pm 0.5^{\dagger}$ & 1.32 & $2.03 \pm 0.6^{\dagger}$ & 1.85 & $1.75 \pm 1.0^{\dagger}$ & 1.26 \\ \hline
		\end{tabular}
	}
	\label{table:Evaluation_HMC_msd} 
\end{table}

\subsection{Evaluation of Architectures on the HMC Dataset}

Quantitative results are given in Table~\ref{table:Evaluation_HMC_msd}, and example results in Figure~\ref{VisualResultCSSegPath}. The first two rows in  Table~\ref{table:Evaluation_HMC_msd} show the results from the single-task networks in terms of  MSD. The registration network works better than the segmentation network on most organs as it essentially uses prior knowledge of the organs of the patient by warping the manually delineated planning scan.
The segmentation network performed better on the bladder, since the registration network often had trouble establishing a correspondence between the bladder in the fixed image and the moving image as this organ tends to deform considerably between visits.
The segmentation network failed to classify any voxel as seminal vesicles in 5 cases.
The seminal vesicles are hard to identify because of their small size and poor contrast, which explains the relatively poor performance of the segmentation network on this organ. The registration network has the benefit of being able to use the context, namely the surrounding anatomical features and organs, to more accurately warp the seminal vesicles into place.

The results from the loss-joined JRS-registration network are shown in the third row of Table~\ref{table:Evaluation_HMC_msd}. It is clear that the additional segmentation loss during training improves the registration quality significantly. 

The fourth and fifth rows in Table~\ref{table:Evaluation_HMC_msd} show the results of the fully-hard parameter sharing network. The contours from its segmentation path see substantial improvements in accuracy over the contours from the segmentation network. The registration path yields improvements over the single-task registration network, but it does not improve over the JRS-registration network. 
These results demonstrate that architecturally joining segmentation and registration can be very beneficial for the segmentation output and can yield more accurate segmentations than either of the single-task networks.

The cross-stitch network performs the best of all networks, as demonstrated by the results in Table~\ref{table:Evaluation_HMC_msd}. Both the segmentation path and the registration path improve over the corresponding paths of the hard parameter sharing network, though it is again the segmentation path that typically yields the most accurate contours.
%
%
The proposed joint networks, particularly the cross-stitch network, yield significantly better contours than any of the state-of-the-art methods. These results confirm the effectiveness of architecturally joining registration and segmentation for generating accurate organ delineations.

\begin{figure}[t]
\centering
% \scalebox{1}[0.95]{
\includegraphics[width=1\textwidth]{Figures/ExampleContours.png}
% }
\caption{Example contours generated by the single-task networks and the cross-stitch network on the HMC dataset. From left to right, the selected cases are the first, second and third quantile in terms of prostate MSD of the cross-stitch network.
}\label{VisualResultCSSegPath}
\end{figure}

\subsection{Evaluation on the Independent EMC Test Set}

\begin{table}[!tbp]
	\centering
	\setlength{\tabcolsep}{3pt}
	\caption[Table caption text]{MSD (mm) values for the different approaches on the independent EMC test set. $\dagger$ denotes a significant difference (at $p=0.05$) between the cross-stitch network and the other networks. Results for JRS-GAN are not available for this dataset.}
	\resizebox{\textwidth}{!}{
		\begin{tabular}{lrcccccccc} 
			&&\multicolumn{2}{c}{Prostate}&\multicolumn{2}{c}{Seminal vesicles}&\multicolumn{2}{c}{Rectum}& \multicolumn{2}{c}{Bladder} \\ \hline
			& Output Path & $\mu \pm \sigma$ & Median & $\mu \pm \sigma$ & Median & $\mu \pm \sigma$ & Median & $\mu \pm \sigma$ & Median \\ \hline
\multicolumn{2}{l}{ Segmentation }& $3.18 \pm 1.8^{\dagger}$ & 2.57 & $9.33 \pm 10.1^{\dagger}$ & 5.82 & $5.79 \pm 3.4^{\dagger}$ & 5.18 & $\textbf{1.88} \pm \textbf{1.5}~$ & 1.50 \\ \hline
\multicolumn{2}{l}{ Registration }& $2.01 \pm 2.5^{\dagger}$ & 1.18 & $2.86 \pm 5.2^{\dagger}$ & 1.18 & $2.89 \pm 2.5^{\dagger}$ & 2.23 & $5.98 \pm 4.7^{\dagger}$ & 4.44 \\ \hline
\multicolumn{2}{l}{ JRS-Registration }& $1.96 \pm 2.6^{\dagger}$ & 1.16 & $2.60 \pm 4.9^{\dagger}$ & 1.07 & $2.64 \pm 2.3~$ & 2.14 & $5.15 \pm 4.4^{\dagger}$ & 3.14 \\ \hline
Fully Hard Sharing & \textit{Segmentation} & \textcolor{gray}{$2.02 \pm 2.5^{\dagger}$} & \textcolor{gray}{1.34} & \textcolor{gray}{$6.34 \pm 10.3^{\dagger}$} & \textcolor{gray}{1.98} & \textcolor{gray}{$3.27 \pm 2.9~$} & \textbf{2.10} & $2.66 \pm 2.6^{\dagger}$ & 1.38 \\
 & \textit{Registration} & $2.00 \pm 2.6^{\dagger}$ & 1.20 & $2.66 \pm 5.2^{\dagger}$ & 1.12 & $2.66 \pm 2.2^{\dagger}$ & \textcolor{gray}{2.24} & \textcolor{gray}{$5.09 \pm 4.2^{\dagger}$} & \textcolor{gray}{2.84} \\ \hline
Cross-Stitch & \textit{Segmentation} & \textcolor{gray}{$1.88 \pm 2.2~$} & \textcolor{gray}{1.21} & \textcolor{gray}{$4.73 \pm 8.0~$} & \textcolor{gray}{1.42} & \textcolor{gray}{$3.61 \pm 5.0~$} & \textcolor{gray}{2.18} & $2.45 \pm 2.4~$ & \textbf{1.24} \\
 & \textit{Registration} & $1.82 \pm 2.4~$ & \textbf{1.09} & $2.45 \pm 5.0~$ & \textbf{1.02} & $\textbf{2.57} \pm \textbf{2.3}~$ & \textbf{2.10} & \textcolor{gray}{$4.93 \pm 4.1~$} & \textcolor{gray}{2.69} \\ \hline
\multicolumn{2}{l}{ Elastix~\cite{Elastix} }& $\textbf{1.42} \pm \textbf{0.7}~$ & 1.17 & $2.07 \pm 2.6^{\dagger}$ & 1.24 & $3.20 \pm 1.6^{\dagger}$ & 3.07 & $5.30 \pm 5.1^{\dagger}$ & 3.27 \\ \hline
\multicolumn{2}{l}{ Hybrid~\cite{MedPhys} }& $1.55 \pm 0.6^{\dagger}$ & 1.36 & $\textbf{1.65} \pm \textbf{1.3}~$ & 1.22 & $2.65 \pm 1.6~$ & 2.36 & $3.81 \pm 3.6^{\dagger}$ & 2.26 \\ \hline
		\end{tabular}
	}
	\label{table:Evaluation_EMC_msd} 
\end{table}
Table~\ref{table:Evaluation_EMC_msd} provides quantitative results on the independent test set. 
The segmentation network failed to classify any voxel as seminal vesicles in 5 cases, and the segmentation paths of the fully hard sharing network and the cross-stitch network in 1 case. 
Note that the deep-learning approaches have not been re-trained nor fine-tuned. Again, the joint networks outperform the single-task networks as well as the state-of-the art methods in terms of the median values that are less influenced by outliers. The mean values are relatively high compared to the median values. This discrepancy can be explained by the intensity variations between the population of the training set and test set causing more outliers. 


