\section{Methods}

\subsection{Base Network Architecture}

We use a 3D deep convolutional neural network derived from the U-Net~\cite{Unet} and inspired by~\cite{Birnet} as a base architecture for all networks presented in this paper.
The network uses $3\times3\times3$ convolution layers without padding.  
LeakyReLU and batch normalization are applied after each convolution layer.
Strided convolutions are used for downsampling in the contracting path, and upsampling layers are used for the expansive path. 
The network has three output resolutions and is deeply supervised at each resolution.
At the lower resolution the network can focus on large organs or large deformations and vice versa at higher resolutions.
Given an input patch of size $n^3$, the sizes of the high, middle, and low resolution patches are $(n-40)^3$, $(\frac{n}{2}-18)^3$, and $(\frac{n}{4}-7)^3$, respectively, where $n = 96$ is used in this paper. A detailed schematic is given in the appendix.



\begin{figure}[tb]
\centering
\scalebox{1}[0.9]{
\includegraphics[width=0.85\textwidth]{Figures/FullyHardSharingNetwork.png}
}
\caption{The inputs, architecture, outputs, and the losses of the fully hard parameter sharing network. Here, S stands for the segmentation layer, R for the registration layer, and S+R for shared layer. Only the highest resolution $1\times1\times1$ convolution layers and outputs are shown here for the sake of clarity.
}\label{ArchitectureFHard} 
\end{figure}

\subsection{Single-Task Segmentation and Registration Networks} 

Single-task segmentation and registration networks were trained to serve as a baseline for the performance of the proposed joint networks. These networks have identical architectures except for the input layers and output layers.
%As illustrated in Figure~\ref{ArchitectureSeg}
The segmentation network takes the daily CT scan as input, which we refer to as the fixed image \IF, and predicts the corresponding segmentation \SPred.
%
% \begin{figure}[ht]
% \centering
% \scalebox{1}[0.9]{
% \includegraphics[width=0.9\textwidth]{Figures/NetworkSeg.png}
% }
% \caption{The input, architecture, output and losses of the separate segmentation network. The underlined layer is the fully convolutional output layer. The lower level fully convolutional layers and their outputs are omitted for clarity.}\label{ArchitectureSeg}
% \end{figure}
%
The segmentation network is trained using the Dice Similarity Coefficient (DSC) loss, which quantifies the overlap between \SPred and the ground truth segmentation \SF.
% \subsection{Registration Network and and Losses}
The registration network takes both the planning scan, which we refer to as the moving image \IM, and the daily scan \IF as input and establishes the correspondence between the two images in the form of a Deformation Vector Field (DVF, $\phi^{pred}$). 
For this purpose, it is crucial that corresponding anatomical features in the two scans fit inside the network's field of view, therefore the images have been affinely aligned beforehand.
The predicted DVF $\phi^{pred}$ is then used to warp \IM such that ideally, the warped moving image \IWarp is identical to \IF. 
The registration network is trained using the Normalized Cross-Correlation (NCC) loss that quantifies the dissimilarity between \IWarp and \IF, and the bending energy loss as a regularization term to encourage smoothness of $\phi^{pred}$. %/of the deformation vector field.


% \begin{figure}[ht]
% \centering
% \scalebox{1}[0.9]{
% \includegraphics[width=0.9\textwidth]{Figures/NetworkReg.png}
% }
% \caption{The input, architecture, output and losses of the separate registration network.}\label{ArchitectureReg}
% \end{figure}


\subsection{Joining Registration and Segmentation via the Loss}

Similar to previous work~\cite{JrsGan}, the network in this approach joins registration and segmentation through the loss function. The network is relatively similar to the registration network discussed in the previous section, with the addition that it also takes \SM as input and is jointly trained using a segmentation Dice loss in addition to the NCC and bending energy losses. This Dice loss penalizes discrepancies between the fixed ground truth segmentation \SF and the warped moving segmentation \SWarp.

% \begin{figure}[ht]
% \centering
% \scalebox{1}[0.9]{
% \includegraphics[width=0.9\textwidth]{Figures/NetworkJRSReg.png}
% }
% \caption{The input, architecture, output and losses of the JRS-registration network.}\label{ArchitectureJRSReg} 
% \end{figure}

\begin{figure}[ht]
\centering
\scalebox{1}[0.9]{
\includegraphics[width=0.9\textwidth]{Figures/CrossStitchNetwork.png}
}
\caption{The inputs, architecture, outputs, and losses of the cross-stitch network.}\label{ArchitectureCS} 
\end{figure}

\subsection{Joint Registration and Segmentation using Hard Parameter Sharing}

In this joint network, see Figure \ref{ArchitectureFHard}, the registration and segmentation sub-networks share all their parameters, except for the task-specific $1\times1\times1$ convolution layers.
Apart from these two layers, the network is architecturally similar to the single-task networks.
The network is trained with the Dice loss on the segmentation output (similar to the segmentation network), and the NCC, bending energy, and Dice losses on the registration output (similar to the JRS-registration network).
%
%
%
Since the network predicts two segmentation maps, one for each path, the contours from one path can be discarded. A simple strategy is to keep the contours from the path that performed best on the validation set. The segmentations can also be selected on a per-organ basis. 


\subsection{Joint Registration and Segmentation via a Cross-Stitch Network}

We propose to architecturally join 3D Unet-like networks for registration and segmentation by connecting the paths using cross-stitch units~\cite{CrossStitch}. 
The cross-stitch units linearly combine pairs of feature maps from the segmentation path and the registration path using learnable parameters $\bm{\alpha}$. Given the segmentation path $S$ and the registration path $R$ of the joint network, the feature maps of filter $k$ in layer $\ell \in \{3, 6, 9, 12\}$ -- named $X^{\ell, k}_S$ and $X^{\ell, k}_R$ respectively -- are connected to a cross-stitch unit with learnable parameters $\alpha^{\ell,k}_{SS}$, $\alpha^{\ell,k}_{SR}$, $\alpha^{\ell,k}_{RS}$ and $\alpha^{\ell,k}_{RR}$.
This cross-stitch unit calculates $X^{\prime\ell, k}_S = $ $\alpha^{\ell,k}_{SS} \cdot X^{\ell, k}_S + \alpha^{\ell,k}_{SR}  \cdot X^{\ell, k}_R$ for the segmentation path and $X^{\prime\ell, k}_R = $ $\alpha^{\ell,k}_{RS} \cdot X^{\ell, k}_S + \alpha^{\ell,k}_{RR}  \cdot X^{\ell, k}_R$ for the registration path. 
The cross-stitch network has the advantage of being able to learn to strongly share feature maps between the tasks if that is beneficial. Conversely, if it is better for pairs of feature maps to be completely independent, the network can learn the identity matrix to separate those feature maps. This allows representations to be shared between the two paths in a flexible manner, at a negligible cost in terms of number of parameters.

We place the cross-stitch units after the downsampling and upsampling layers, so at four positions in total. 
This is in line with the original cross-stitch paper, where the authors suggest placing cross-stitch units after every pooling activation map.
We found that the number of units is more crucial than their location as long as the units are distributed evenly across the network. For example, placing the cross-stitch units before the downsampling and upsampling layers instead of after them does not alter the performance, but placing a large number of cross-stitch units, such as units after every layer, will degrade the performance of the network. The proposed architecture is visualized in Figure \ref{ArchitectureCS}.


