\begin{figure}[t]
    \centering
    \includegraphics[width=0.85\textwidth]{figures/demonstrate.png}
    \caption{(a) Approximation set. (b) and (h): A transverse slice from the target and source image, respectively. (c)-(g) top row: Warped source images corresponding to five solutions (highlighted with matching frame color) in the set with bladder and rectum contours in cyan and magenta colors, respectively. Solid contours represent the contour in the target image and dashed contours represent the warped source image contour. (c)-(g) bottom row: DVFs overlaid on the source image. Displacement in the x-y plane is represented by direction and scale, and in the z-direction by color (red for cranial, and blue for caudal motion) of arrows.}
    \label{fig:demonstrate}
    \vspace{-5mm}
\end{figure}

We implemented\footnote{The implementation is available at \href{https://github.com/monikagrewal/DL-MODIR/tree/public}{https://github.com/monikagrewal/DL-MODIR/tree/public}.} our proposed approach using Python and PyTorch. The training hyperparameters were: number of solutions $p$ = 27, initialization = Kaiming He, optimizer = Adam, learning rate (lr) = $1e^{-4}$, number of training iterations = 20K, reference point for HV calculation = (1, 1, 1) (details in Appendix \ref{appendix: reference point}). For each experimental setting, we trained 5 models, each corresponding to a different data split. We report their performance on the test set without model selection. To assess the DIR performance, we calculated target registration errors (TREs) of the 23 manually annotated landmarks by transforming the landmarks in the target image with the predicted DVF and calculating the Euclidean distance with the corresponding landmarks in the source image. We also calculated the percentage of voxels with a negative determinant of the spatial Jacobian of the DVF, as an indication of folding in the transformation.

\subsection{Comparison of MO DIR with Single DIR Output}
Contrary to traditional DIR, in MO DIR, the decision maker (in our case a clinical expert) is provided with multiple DIR solutions spread across a range of trade-offs between conflicting objectives. This is demonstrated in Figure \ref{fig:demonstrate} (a). The figure shows that there are multiple possible ways to align the two images. In DIR, the solutions at the extremes of the approximation set are likely not interesting because they might be overfitted to a single objective and consequently may yield sub-optimal performance in other objectives. For example, the solution highlighted in the red frame (Output 1) corresponds to minimum $L_{ImageSimilarity}$, but maximum $L_{DVFSmoothness}$ causing a lot of folding in the DVF. Similarly, the solution highlighted in brown (Output 5) corresponds to no deformation at all. To assist the a posteriori decision-making, such uninteresting solutions can be filtered out by setting acceptance thresholds on each objective. The region of interest in the objective (loss) space where all the acceptance criteria are met, could be considered the preferred region. In Figure \ref{fig:demonstrate} (a), we show this region with arbitrarily selected acceptance thresholds ($L_{ImageSimilarity} < 0.55$, $L_{DVFSmoothness} < 0.1$, and $L_{SegSimilarity} < 0.025$).
%\footnote{The acceptance thresholds are application specific e.g., in radiation treatment planning these are decided by the minimum and maximum dose requirements for an organ \cite{van2020bi}.}.

Within a preferred region of interest, one solution cannot be selected over another based on quantitative comparison of performance metrics as demonstrated in Figure \ref{fig:demonstrate}.
The solution highlighted in green (Output 2) has minimum folding in the DVF, magenta (Output 3) has minimum mean TRE of landmarks, and blue (Output 4) has maximum Dice similarity between organ masks while other metrics are worse. While Output 2 and Output 3 have less folding in the DVF and smaller mean TRE between landmarks, the warped bladder contours (dashed cyan color) considerably deviate from the target bladder contours (solid cyan color) as compared to Output 4. This is due to MO training of the DIR neural network, which ensures that the obtained DIR solutions are all (close to) Pareto optimal i.e., no solution is better than another in any objective without a simultaneous detriment on other objectives. In such a scenario, the most appropriate DIR output can only be selected after visual inspection of the DIR outputs in the preferred region of interest and considering other clinical criteria. For example, the visual inspection of the DVF from Output 4 may reveal that the folding occurs in regions not relevant for brachytherapy treatment. Further, the alignment of the bladder may be more important than the alignment of some landmarks in other regions. Therefore, a clinical expert may prefer Output 4 over Output 3 despite a larger mismatch between landmarks and more folding in the DVF in this test scan pair. Whereas, in another test scan pair, the characteristics of the DVF may be different and the clinical preference may be reversed. Moreover, it is already known from previous research that the weights, which translate to a given trade-off between objectives on the approximation front and the quantitative value of the performance metrics are different in different scan pairs \cite{pirpinia2017feasibility}. This means that the preferred region of interest corresponds to different solutions in the approximation sets from different scan pairs.
% Further, it is already known from earlier research that the optimal weights of different objectives corresponding to best quantitative DIR performance are different between different test scans [REF]. This means in another test scan pair, Output 2 may exhibit minimum TRE or Output 3 may exhibit minimum folding in the DVF. Our results (shown in Appendix REF) corroborate these findings.

Because multiple solutions are provided with MO DIR that are spread in objective space, the clinical expert can navigate through these solutions and select an appropriate trade-off based on the underlying clinical scenario. In contrast, with traditional single DIR, only one of these solutions is provided to the clinical expert. Therefore, the opportunity to evaluate other possibilities and make an informed decision specifically tuned to each patient is lost.

\subsubsection{Comparison of Computational Overhead}
In the case of single DIR, a DIR network is trained multiple times with different weight combinations for each loss function following a certain strategy. The weights yielding the best aggregated performance on a validation set are used for final training. In MO DIR, multiple neural networks (in our case a single DIR network with multiple decoders) are trained. Therefore, the training overhead of MO DIR in terms of runtime is similar to that of single DIR. However, in MO DIR, the training is done in parallel, requiring more memory. In our implementation, training for $p = 27$ required $\sim$39 GB and $\sim$32 GB without and with a shared encoder, respectively, as compared to $\sim$3.5 GB required for training a single DIR network.

% \begin{table}[]
% \centering
% \caption{Minimum mean TRE of 23 anatomical landmarks in mm, associated folding in \% for grid search and MO DIR, respectively for each test scan pair. Mean $\pm$ standard deviation from 5 models from 5-fold cross-validation is reported.}
% \begin{tabular}{@{}lc|cc|cc@{}}
% \toprule
% \multirow{2}{*}{Test scan}    & \multirow{2}{*}{TRE before} & \multicolumn{2}{c}{\textbf{Grid Search}} & \multicolumn{2}{c}{\textbf{MO DIR}} \\ \cmidrule(l){3-6} 
%      && TRE                  & \% folding                       & TRE                   & \% folding                       \\ \midrule
% 1 & 3.97 & 3.68 $\pm$ 0.07 & 0.21 $\pm$ 0.16  & 3.75 $\pm$ 0.03, & 0.16 $\pm$ 0.25\\
% 2 & 4.71 & 4.44 $\pm$ 0.09 & 2.86 $\pm$ 0.50  & 4.59 $\pm$ 0.09, & 2.09 $\pm$ 0.35\\ 
% 3 & 8.21 & 8.04 $\pm$ 0.09 & 2.81 $\pm$ 1.27  & 8.10 $\pm$ 0.06, & 1.54 $\pm$ 1.81\\ 
% 4 & 9.07 & 8.17 $\pm$ 0.11 & 0.17 $\pm$ 0.14  & 8.56 $\pm$ 0.08, & 0.55 $\pm$ 0.40\\ 
% 5 & 4.46 & 4.03 $\pm$ 0.08 & 0.94 $\pm$ 0.67  & 4.09 $\pm$ 0.05, & 0.83 $\pm$ 0.67\\ 
% 6 & 5.55 & 4.56 $\pm$ 0.08 & 1.31 $\pm$ 0.16  & 4.77 $\pm$ 0.15, & 0.69 $\pm$ 0.44\\ 
% 7 & 5.99 & 5.85 $\pm$ 0.04 & 0.20 $\pm$ 0.20  & 5.90 $\pm$ 0.04, & 0.13 $\pm$ 0.13\\ 
% 8 & 4.39 & 4.02 $\pm$ 0.07 & 2.09 $\pm$ 1.29  & 4.07 $\pm$ 0.05, & 1.12 $\pm$ 0.88\\ 
% 9 & 5.73 & 5.12 $\pm$ 0.12 & 0.64 $\pm$ 0.18  & 5.24 $\pm$ 0.10, & 1.09 $\pm$ 1.12\\ 
% 10 & 3.80 & 3.73 $\pm$ 0.03 & 0.10 $\pm$ 0.14  & 3.67 $\pm$ 0.02, & 0.44 $\pm$ 0.52\\ \bottomrule
% \end{tabular}
% \label{tab:descriptives}
% \end{table}

\begin{figure}[h]
    \centering
    \includegraphics[width=\textwidth]{figures/ls_vs_hv.png}
    \caption{Approximation sets obtained for the first four test scan pairs by linear scalarization (red circles) and the proposed HV-based MO DIR (blue triangles). The approximation sets from five different models trained with different training data splits are shown with slight variations in the color saturation to give an indication of model variance.}
    \label{fig:LS vs. HV}
    \vspace{-5mm}
\end{figure}

\subsection{Comparison of Proposed MO DIR with Linear Scalarization}
\label{experiment: ls vs hv}
In the proposed MO DIR, we used HV maximization to dynamically find the weights for each loss term such that the differently weighted loss training of different neural network heads yields their outputs diversely spread across the approximation front. It may be speculated that a similar diversity of outputs can be trivially obtained by training the different neural networks with uniformly distributed weights for different losses. Such an approach is called `linear scalarization'. \citet{deist2023multi}, in their paper, compared linear scalarization with HV maximization for different shapes of Pareto fronts. The authors observed that the translation of the weights to a location on the front is dependent on the shape of the Pareto front, and is as such non-trivial. To investigate this in the case of MO DIR, we compared the proposed HV maximization based MO DIR approach with linear scalarization based MO DIR. To simulate the MO DIR set up with linear scalarization, we trained the different heads of our MO DIR neural network with weights corresponding to diversely distributed points in a grid. We used 27 grid points by enumerating over all the possible combinations for $w_1 \in \{0, 0.5, 1\}$, $w_2 \in \{0, 0.1, 0.5, 1\}$, and $w_3 \in \{0, 0.5, 1\}$ and omitting redundant (e.g., $\{0, 0.5, 0.5\}$ and $\{0.5, 0.5, 0.5\}$). It should be noted that this process of selecting linear scalarization weights is already slightly better than naive linear scalarization.

The approximation sets obtained from linear scalarization vs HV maximization based MO DIR are shown in Figure \ref{fig:LS vs. HV}. It is apparent upon visual inspection of the figure that even though the weights used for linear scalarization were diversely distributed, still the obtained solutions are clustered along two edges of the expected triangle-like approximation front. There is a void of solutions in the center region of the expected triangle-like approximation front. This observation corroborates the results in \cite{deist2023multi} - the diverse spread of solutions across the approximation front cannot be obtained trivially through linear scalarization - in the case of DIR as well. In contrast, visual inspection of the solutions in the approximation set obtained using HV maximization based MO DIR, shows a rough triangle-like shape with diversely distributed points in the center as well. This is because HV maximization ensures not only proximity to the Pareto front but also diversity across the approximation front. 

% In our implementation, training for $p = 27$ required $\sim$ 39335MiB, 32715MiB with original MO and shared encoder, respectively as compared to 3581MiB required for training a single DIR network. 