\section{Effect of Parameter Sharing in the Encoder}
\label{appendix: parameter sharing}
% We compared the loss curves and obtained approximation fronts on the validation data between a naive implementation of MO DIR (one registration network for each point on the approximation front) and the proposed implementation (shared encoder for all registration networks). Figure \ref{fig:parameter sharing} shows the comparison of the training losses of the neural networks trained using MO training without and with parameter sharing in the encoder. As can be seen that the range of the losses is not affected so much by parameter sharing. Further, the hypervolume of the validation data (which is an indication of the diversity of predicted DVFs along with their proximity to the Pareto front) corroborates that parameter sharing in the encoder causes only a small decrease in the diversity of the predictions.

\begin{figure}[h]
    \centering
    \includegraphics[width=\textwidth]{figures/parameter_sharing.png}
    \caption{Effect of parameter sharing in the Encoder. filled circles: MO DIR without parameter sharing in the encoder, triangles: MO DIR with parameter sharing in the encoder. $p=5$, $n=2$. Approximation sets obtained from 5 models trained on different data splits are shown. Each color represents a DIR solution corresponding to a specific trade-off between $L_{ImageSimilarity}$ and $L_{DVFSmoothness}$.}
    \label{fig:effect of parameter sharing}
\end{figure}

In Figure \ref{fig:effect of parameter sharing}, 5 approximation sets obtained from 5 models after 5-fold cross-validation, by training the MO DIR approach with $p=5$ for $L_{ImageSimilarity}$, and $L_{DVFSmoothness}$ losses without (filled circles) and with parameter sharing (triangles) in the encoder are shown for all the test scan pairs. The figure shows that parameter sharing does not impact the distribution of solutions on the front.

\newpage
\section{Description of Landmarks}
\label{appendix: landmarks list}

\begin{figure}[h]
\begin{tabular}{ccl}
    \multicolumn{3}{c}{\includegraphics[width=0.8\textwidth]{figures/landmarks.png}} \\
    & L1 & Internal urethral ostium\\
    & L2 & External urethral ostium\\
    & L3 & Uterus top\\
    & L4 & Cervical ostium\\
    & L5 & Isthmus\\
    & L6 & Intra-uterine canal top\\
    & L7 & Right ureteral ostium\\
    & L8 & Left ureteral ostium\\
    & L9 & Internal anal sfincter\\
    & L10 & Os coccygis\\
    & L11 & Most ventral intersections of S2-S3\\
    & L12 & Most ventral intersections of S3-S4\\
    & L13 & Anterior superior border sympysis (ASBS)\\
    & L14 & Posterior inferior border sympysis (PIBS)\\
    & L15 & Right femur head\\
    & L16 & Left femur head\\
    & L17 & Left acetabulum\\
    & L18 & Right acetabulum\\
    & L19 & Left ligament rotundum\\
    & L20 & Right entrance of uterine artery to cervix\\
    & L21 & Left entrance of uterine artery to cervix\\
    & L22 & Right ligament rotundum\\
    & L23 & Most ventral intersections of S1-S2\\
    \bottomrule
\end{tabular}
    \caption{Description of landmarks. The landmarks are projected on a coronal (left) and sagittal (right) slice. L23 is not visible in this scan.}
    \label{fig:landmarks description}
\end{figure}

\newpage
\section{Effect of Selecting Reference Point}
\label{appendix: reference point}
\begin{figure}
    \centering
    \includegraphics[width=\textwidth]{figures/genmed.png}
    \caption{Effect of the location of reference point on the GenMED \cite{bosman2011gradients} benchmark problem. The Pareto front was approximated using 25 points. The solutions from 10 runs are shown for two different locations of the reference point.}
    \label{fig:effect of reference point}
\end{figure}
The calculation of the HV (and consequently its gradients) is sensitive to the choice of the reference point \cite{ishibuchi2018specify}, which, in turn, affects the spread of the solutions on the front. This is particularly the case for three or more objectives.  In Figure \ref{fig:effect of reference point}, this phenomenon is illustrated with experiments on the convex GenMED problem with three objectives \cite{bosman2011gradients}. Briefly, in the GenMED problem, the $n$ objectives (in our case, $n = 3$ i.e., f1, f2, f3 are the sum of square distances from $n$ unit vectors. When the reference point is far away, the final solutions tend to cluster on the edges of the Pareto front. The spread of the points becomes more uniform across the Pareto front when the reference point is moved closer. Based on these empirical observations, we tuned the reference point for MO DIR training. We considered the following choices: (10, 10, 10), (1, 1, 1), (1, 1, 0.2), (0.5, 1, 1) based on observing the worst loss values after training. For experiments in the paper, we selected (1, 1, 1) as the reference point because it provided well distributed points across the front based on visual inspection on validation set.

\if false
\section{Comparison between MO DIR Without and With Additional Guidance}
To gain insights into the effect of additional guidance from organ masks on the DIR performance, we compared the following two settings: : a) MO DIR using $L_{ImageSimilarity}$, and $L_{DVFSmoothness}$ (no additional guidance), b) MO DIR using $L_{ImageSimilarity}$, $L_{DVFSmoothness}$, and $L_{SegSimilarity}$ (additional guidance). In Figure \ref{fig:effect of additional guidance}, the obtained approximation sets on test scan pairs from both settings are shown in the objective space of $L_{ImageSimilarity}$, $L_{DVFSmoothness}$, and $L_{SegSimilarity}$. The figure shows that training MO DIR with the additional guidance from organ masks, some solutions are obtained in the region corresponding to lower $L_{SegSimilarity}$ loss but higher $L_{ImageSimilarity}$ loss. These solutions underline the conflict between $L_{ImageSimilarity}$ and $L_{SegSimilarity}$, whose nature and causes could only be known after exploring the DIR outputs corresponding to these solutions. It is worth noting that with MO DIR, such an exploratory analysis is possible and straight-forward. 

Furthermore, in Table \ref{tab:additional guidance}, the maximum mean Dice score and \% folding in the associated DVF of an approximation set is reported for each test scan pair. Similar to Figure \ref{fig:effect of additional guidance}, Table \ref{tab:additional guidance} also shows that by training DIR with additional guidance from organ masks, higher similarity between organ masks (indicated by high Dice scores) can be achieved without compromising with \% folding in the DVFs. 
It is important to state here that the best solutions in the approximation sets according to Dice score (reported in Table \ref{tab:additional guidance}) are not same as the best solutions according to TRE values (reported in Table \ref{tab:descriptives}), highlighting the nuances of evaluating a DIR outcome. Further, it is difficult to make clinically relevant performance comparisons solely based on quantitative values due to two reasons: a) mean Dice score is biased towards large organs, b) the solution corresponding to maximum Dice score may be overfitted to $L_{SegSimilarity}$ loss.

\begin{figure}
    \centering
    \includegraphics[width=\textwidth]{figures/additional_guidance_lowres.png}
    \caption{Effect of additional guidance. filled circles: MO DIR without additional guidance from organ contours, triangles: MO DIR with additional guidance from organ contours. $p=27$. Approximation sets obtained from 5 models of 5-fold cross-validation are shown.}
    \label{fig:effect of additional guidance}
\end{figure}


\begin{table}[]
\centering
\caption{Maximum mean percent Dice score of four organs at risk (bowel bag, bladder, rectum, and sigmoid), and associated \% folding for approximation sets obtained from MO DIR without and with guidance from organ masks, for each test scan pair. Mean $\pm$ standard deviation from 5 models from 5-fold cross-validation is reported.}
\begin{tabular}{@{}l|cc|cc@{}}
\toprule
\multirow{2}{*}{Test scan}  & \multicolumn{2}{c}{\textbf{No Guidance}} & \multicolumn{2}{c}{\textbf{Guidance}} \\ \cmidrule(l){2-5} 
     & \% Dice                  & \% folding                       & \% Dice                   & \% folding                       \\ \midrule
1 & 97.63 $\pm$ 0.04 & 1.24 $\pm$ 0.26  & 99.28 $\pm$ 0.06, & 0.77 $\pm$ 0.22\\
2 & 92.75 $\pm$ 0.09 & 1.03 $\pm$ 0.28  & 95.66 $\pm$ 0.26, & 1.38 $\pm$ 0.28\\ 
3 & 96.25 $\pm$ 0.07 & 0.64 $\pm$ 0.37  & 98.99 $\pm$ 0.10, & 0.93 $\pm$ 0.17\\ 
4 & 96.56 $\pm$ 0.04 & 0.69 $\pm$ 0.18  & 98.53 $\pm$ 0.13, & 0.87 $\pm$ 0.28\\ 
5 & 94.58 $\pm$ 0.05 & 0.03 $\pm$ 0.07  & 98.24 $\pm$ 0.09, & 0.66 $\pm$ 0.07\\ 
6 & 96.49 $\pm$ 0.13 & 1.36 $\pm$ 0.27  & 98.73 $\pm$ 0.13, & 1.00 $\pm$ 0.37\\ 
7 & 96.93 $\pm$ 0.02 & 0.93 $\pm$ 0.53  & 99.01 $\pm$ 0.11, & 0.96 $\pm$ 0.49\\ 
8 & 97.56 $\pm$ 0.09 & 0.63 $\pm$ 0.16  & 99.07 $\pm$ 0.07, & 0.64 $\pm$ 0.11\\ 
9 & 95.63 $\pm$ 0.03 & 1.89 $\pm$ 0.44  & 98.01 $\pm$ 0.11, & 1.09 $\pm$ 0.42\\ 
10 & 95.04 $\pm$ 0.01 & 0.67 $\pm$ 0.17  & 97.48 $\pm$ 0.12, & 1.02 $\pm$ 0.26\\ \bottomrule
\end{tabular}
\label{tab:additional guidance}
\end{table}
\fi

\section{Quantitative Comparison of DIR Performance}
Although TRE is a sparse metric and affected by inter- and intra-observer variation in the placement of landmarks, it is often used to quantitatively assess the performance of a DIR method. In this section, we compare the  linear scalarization and proposed MO DIR approach described in section \ref{experiment: ls vs hv} in terms of mean TRE of 23 landmarks. First, we automatically select a single DIR solution from each approximation set. For this, we assume that a clinical expert would a posteriori select the DIR solution corresponding to minimum mean TRE of 23 landmarks. The underlying idea is that even if the TRE is not explicitly computed, the expert intuitively looks for solutions where landmarks that they are familiar with are well-aligned. In Table \ref{tab:descriptives}, we report the mean and standard deviation of this TRE value from 5 models, each trained on a different training data split to provide an estimate of model variance. We also report the associated folding in the DVF of the selected DIR solution. Although it is difficult to derive any clinical conclusions without inspecting the underlying DVFs, it can be observed that both linear scalarization and HV based MO DIR find quantitatively similar trade-offs between the best TRE values and associated DVF folding. This is not entirely surprising, given that the underlying DL architecture for DIR is the same for both methods.

One might notice a trend of higher TRE values and lower image folding in the selected solutions from HV maximization based MO DIR. However, it is important to realize that the training approach may play a role in this and that training for MO DIR and linear scalarization proceeds differently. Training neural networks with HV maximization is more complex as compared to using fixed weights as in the case with linear scalarization. This is because of the dynamically changing gradients for each network head as a consequence of the HV maximization goal. Therefore, if the exact weights corresponding to the desired trade-off between each objective are known a priori, linear scalarization may yield non-dominated solutions faster. For a fair comparison, we trained the networks in both the linear scalarization and the MO DIR approach with the same number of iterations. It may be possible that this was not the saturation point for both procedures. Ideally, upon saturation, we would expect both linear scalarization and HV maximization to obtain solutions with the same proximity to the Pareto front. However, obtaining the same diversity of solutions (for a given $p$) along the front is not guaranteed for linear scalarization. As demonstrated in section \ref{experiment: ls vs hv}, this is because the translation from scalarization weights to a well distributed set of solutions along the approximation front is not trivial. Therefore, achieving a diverse spread of solution through linear scalarization would require trying many more combinations. On the other hand, with the HV maximization based MO DIR approach, it can be achieved in a single go.

\begin{table}[]
\centering
\caption{Mean TRE and associated \% folding in DVF of the `best' solution in the approximation set obtained by linear scalarization and MO DIR, respectively for each test scan pair. In each approximation set, the solution corresponding to minimum mean TRE of 23 landmarks is assumed `best' for the sake of quantitative comparison. Mean $\pm$ standard deviation from 5 models trained on different training data splits is reported without model selection.}
\begin{tabular}{@{}lc|cc|cc@{}}
\toprule
\multirow{2}{*}{Test scan}    & \multirow{2}{*}{TRE before} & \multicolumn{2}{c}{\textbf{Linear Scalarization}} & \multicolumn{2}{c}{\textbf{MO DIR}} \\ \cmidrule(l){3-6} 
     && TRE                  & \% folding                       & TRE                   & \% folding                       \\ \midrule
1 & 3.97 & 3.63 $\pm$ 0.04 & 0.29 $\pm$ 0.19  & 3.74 $\pm$ 0.03, & 0.05 $\pm$ 0.03\\
2 & 4.71 & 4.53 $\pm$ 0.11 & 3.45 $\pm$ 0.38  & 4.66 $\pm$ 0.07, & 2.00 $\pm$ 1.23\\ 
3 & 8.21 & 8.04 $\pm$ 0.10 & 1.33 $\pm$ 1.18  & 8.12 $\pm$ 0.06, & 1.07 $\pm$ 1.64\\ 
4 & 9.07 & 8.18 $\pm$ 0.07 & 0.12 $\pm$ 0.15  & 8.58 $\pm$ 0.17, & 0.47 $\pm$ 0.39\\ 
5 & 4.46 & 4.01 $\pm$ 0.06 & 0.80 $\pm$ 0.96  & 4.08 $\pm$ 0.07, & 1.36 $\pm$ 1.01\\ 
6 & 5.55 & 4.52 $\pm$ 0.09 & 1.31 $\pm$ 0.17  & 4.69 $\pm$ 0.09, & 0.76 $\pm$ 0.32\\ 
7 & 5.99 & 5.90 $\pm$ 0.03 & 0.26 $\pm$ 0.18  & 5.93 $\pm$ 0.02, & 0.29 $\pm$ 0.13\\ 
8 & 4.39 & 3.96 $\pm$ 0.05 & 2.72 $\pm$ 0.88  & 4.06 $\pm$ 0.05, & 1.72 $\pm$ 1.31\\ 
9 & 5.73 & 5.06 $\pm$ 0.06 & 0.87 $\pm$ 0.24  & 5.24 $\pm$ 0.13, & 0.82 $\pm$ 0.97\\ 
10 & 3.80 & 3.72 $\pm$ 0.03 & 0.20 $\pm$ 0.28  & 3.70 $\pm$ 0.03, & 0.11 $\pm$ 0.13\\ \bottomrule
\multirow{3}{6em}{Mean $\pm$ SD across patients} &&&&&\\
& 5.59 $\pm$ 1.71  & 5.15 $\pm$ 1.63 & 1.14 $\pm$ 1.21  & 5.28 $\pm$ 1.69, & 0.87 $\pm$ 1.04\\
&&&&&\\ \bottomrule
\end{tabular}
\label{tab:descriptives}
\end{table}
