\documentclass{midl} % Include author names

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution
\usepackage{graphicx}
\usepackage{amsmath, amssymb, amsfonts, xcolor}
\usepackage{colortbl}
\usepackage{booktabs}
\usepackage{multirow}
\usepackage{diagbox}
\usepackage{caption}
\usepackage{soul}
\definecolor{myhighlight}{rgb}{0.99,0.99,0.0}
\definecolor{mytext}{rgb}{0.6,0.0,0.99}
\newcommand{\mathcolorbox}[2]{\colorbox{#1}{$\displaystyle #2$}}
\sethlcolor{myhighlight}
\usepackage{mwe} % to get dummy images
\jmlrvolume{-- 227}
\jmlryear{2024}
\jmlrworkshop{Full Paper -- MIDL 2024 submission}
\editors{Accepted for publication at MIDL 2024}

\title[SiamRegQC]{Registration Quality Evaluation Metric with Self-Supervised Siamese Networks}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor{\Name{Tanvi Kulkarni\nametag{$^{1,2}$}} \Email{ee20s046@smail.iitm.ac.in}\\
\addr $^{1}$ Department of Electrical Engineering, Indian Institute of Technology Madras (IITM), India \\
\addr $^{2}$ Healthcare Technology Innovation Centre, IITM, India \AND
\Name{Sriprabha Ramanarayanan\nametag{$^{1,2}$}} \Email{sriprabha.r@htic.iitm.ac.in}\\
\Name{Keerthi Ram\nametag{$^{2}$}} \Email{keerthi@htic.iitm.ac.in}\\
\Name{Mohanasankar Sivaprakasam\nametag{$^{1,2}$}} \Email{mohan@ee.iitm.ac.in}\\ 
}

\begin{document}

\maketitle

\begin{abstract}
Registration is one of the most preliminary steps in many medical imaging downstream tasks. The registration quality determines the quality of the downstream task. Traditionally, registration quality evaluation is performed with pixel-wise metrics like Mean Squared Error (MSE) and Structural Similarity Index (SSIM). These pixel-wise measures are sometimes susceptible to local minima, providing sub-optimal and inconsistent quality evaluation. Moreover, it might be essential to incorporate semantic features crucial for human visual perception of the registration quality. Towards this end, we propose a data-driven approach to learn the semantic similarity between the registered and target images to ensure a perceptual and consistent evaluation of the registration quality. In this work, we train a Siamese network to classify registered and synthetically misaligned pairs of images. We leverage the latent Siamese encodings to formulate a semantic registration evaluation metric, SiamRegQC. We analyze SiamRegQC from different perspectives: robustness to local minima or smoothness of evaluation metric, sensitivity to smaller misalignment errors, consistency with visual inspection, and statistically significant evaluation of registration algorithms with a p-value $<$ 0.05. We demonstrate the effectiveness of SiamRegQC on two downstream tasks; (i) Rigid registration of 2D histological serial sections, where evaluating sub-pixel misalignment errors is critical for accurate 3D volume reconstruction. SiamRegQC provides a more realistic quality evaluation sensitive to smaller errors and consistent with visual inspection illustrated with more perceptual semantic feature maps rather than pixel-wise MSE maps. (ii) Unsupervised multimodal non-rigid registration, where the registration framework trained with SiamRegQC as a loss function exhibits a maximum average SSIM value of 0.825 over previously proposed deep similarity metrics.
\end{abstract}

\begin{keywords}
Image registration, Evaluation metric, Cosine similarity, Siamese network, Semantic representation.
\end{keywords}

\section{Introduction}
\label{sec:introduction}
Registration is the task of aligning a source image to match the physical coordinates of a target image. In medical image analysis, registration is used for many downstream tasks, such as atlas-based segmentation \cite{balakrishnan2019voxelmorph}, \cite{Kulkarni2023LearningTA}, \cite{Aqil2023ConfoundingFM} and reconstruction of 3D volumes of organs/tissues by successively registering their 2D histological serial section images \cite{8575935}. The registration quality directly impacts the effectiveness of the downstream tasks, where even small errors are significant. For instance, in 3D histological reconstruction \cite{Lobachev2021EvaluatingRO}, the misalignment error at every serial section registration accumulates from the middle section to the ends of the volume, resulting in an irregularly reconstructed 3D volume. Leveraging more accurate quality metrics for registration optimization can result in more convergent performances~\cite{Simonovsky2016ADM},~\cite{Czolbe2023SemanticSM}. Traditional metrics like Mean Squared Error (MSE), Structural Similarity Index (SSIM) \cite{Wang2004ImageQA}, Jaccard overlap measures \cite{Kartasalo2018ComparativeAO}, \cite{7482649} and MIND~\cite{Heinrich2012MINDMI} are commonly used to evaluate the registration quality. Furthermore, traditional metrics rely on basic pixel-wise calculations and can be susceptible to local minima \cite{Zhang2018TheUE}. Including semantic features to capture the nuances essential for human visual perception could ensure a consistent and perceptually accurate assessment of the registration quality \cite{Wang2009MeanSE}.

With the advent of Machine Learning (ML) for medical applications, several supervised ML algorithms are proposed that automatically classify affine registered and misaligned pairs of images \cite{SOKOOTI2019110}, \cite{TUMMALA2021104997}. However, these algorithms are supervised using traditional metrics and do not offer quality assessment beyond the binary classification of misaligned pairs. 
% Several Deep Learning (DL) methods use a Bayesian approach to leverage the registration uncertainty as a measure of the registration error \cite{5223699}, \cite{Risholm2013BayesianCO}. However, Bayesian approaches might not be independent of the registration method and might require reiterating the registration step during the error evaluation.
Recently, deep learning (DL) methods have been proposed using semantic representations and intermediate network layers as perceptual metrics for image quality assessment \cite{GAO2017104}, \cite{Zhang2018TheUE}. Such perceptual metrics are shown to provide a more reliable quality assessment than signal-to-noise ratio and SSIM for imaging applications like object detection and denoising. Previously, two-channel Convolutional Neural Networks (CNNs) have been proposed as deep similarity metrics for multimodal registration \mbox{\cite{Cheng2018DeepSL}}, \mbox{\cite{Simonovsky2016ADM}}. Interestingly, DeepSim \mbox{\cite{Czolbe2023SemanticSM}} introduced semantic features derived from unsupervised autoencoders as similarity measures to optimize DL-based registration methods. Unlike the DeepSim method, we use Siamese networks to obtain the semantic representations of the input images efficiently (Refer Appendix ~{\ref{appendix:lit-survey}}).

Siamese networks have been applied for image registration \cite{Chen2021DetarNetDT}, \cite{Neumann2020DeepSL}, \cite{Tang2022ADN} due to their efficiency in training a pair of input samples  (multiple inputs, in general) using the same network parameters. Because of their dual encoder architecture with identical parameters, Siamese networks can transform an input pair of registered and target images into the same latent feature space \cite{Bromley1993SignatureVU}. The cosine similarity function is chosen to formulate our proposed registration evaluation metric due to its desirable property of evaluating similarities irrespective of the dimensionality of features and having a restricted range of values always lying between -1 and +1 as shown in \cite{Nguyen2010CosineSM}. Additionally, to learn distinctive representations, we utilize the cosine similarity function as a contrastive loss~\cite{Chen2020ASF} that encourages or discourages the similarities between registered or misaligned input image pairs (Refer Appendix~{\ref{appendix:rationale}}). To summarize our contributions, we propose SiamRegQC, a data-driven deep learning-based quality evaluation metric for image registration. The proposed metric is agnostic to the registration method and uses semantic representations of the registered and target images learned from a Siamese network. We assess the efficacy of the proposed evaluation metric, SiamRegQC, in the following aspects:

\begin{enumerate}
    \item Robustness to local minima~-~SiamRegQC shows smoother surfaces than MSE and SSIM, as seen from the metric surface plot analysis over the rigid misalignment space. \\[-20pt]
    \item Sensitivity to smaller misalignment errors~-~From the local variance measures of the surface plot analysis, SiamRegQC exhibits the best sensitivity to unit changes in the misalignment space with a value of 1.9e-3 while MSE was found to be insensitive with a sensitivity value as low as 3.5e-5. \\[-20pt]
    \item  Consistency with visual inspection~-~The latent Siamese encodings offer the advantage of a better perceptual understanding of the registration quality than pixel-wise MSE maps. \\[-20pt]
    \item Application to the downstream section-wise 2D rigid registration task for Nissl-stained mouse brain volume reconstruction. Here, we use SiamRegQC as a registration quality evaluation metric for benchmarking the performance of three different registration algorithms, where SiamRegQC critically evaluates the algorithms due to its enhanced sensitivity.  \\[-20pt]
    \item We leverage SiamRegQC as a similarity metric to drive VoxelMorph architecture-based unsupervised non-rigid deformable registration framework for unimodal and multimodal data. A maximum average SSIM value of 0.967 (0.825 for multimodal) was observed in the registration outputs compared to previously proposed methods.    
\end{enumerate}

\section{Methodology}
\label{sec:methodology}
This section introduces the overall network architecture for formulating the proposed registration evaluation metric, SiamRegQC, followed by a description of the dataset and implementation details for evaluating SiamRegQC.

\subsection{Network Architecture}
\label{sec:network_architecture}
The network architecture for SiamRegQC consists of a Siamese network, as shown in Figure~\ref{fig:process_map}. The Siamese network is divided into two functional parts: an encoder, $\phi$, for deep feature extraction of the input images, and a fully connected network, $Clf$, for the classification task. The encoder, $\phi$, consists of 4 convolutional blocks of 128, 64, 32, and 16 channels each. Each layer has a convolutional kernel of size 3, stride 1, and ReLU nonlinear activation followed by a MaxPool layer of stride 2. The classifier, $Clf$, consists of two hidden layers, with 1024 and 256 neurons with ReLU activation (Refer Appendix~\ref{appendix:rationale} and Figure~\ref{fig:arc-differences} for more details).

\begin{figure}
\centering
\includegraphics[width=1.\linewidth]{Plots/Graphical Abstract_updated_small (1).pdf}  
\caption{Graphical Representation of semantic registration evaluation metric, SiamRegQC, and its formulation. Table depicting the different desirable aspects of SiamRegQC.}
\label{fig:process_map}
\end{figure}

\begin{figure}[t!]
    \centering
    \includegraphics[width=0.8\linewidth]{Plots/Architecture Differences (1).pdf}
    \caption{SiamRegQC allows for learning deep semantic features with encoders pre-trained from an autoencoding task to minimize the MSE loss between input and decoded images. In a second training step, SiamRegQC is supervised with a binary misalignment classification task to classify original and synthetically misaligned images.}
    \label{fig:arc-differences}
\end{figure}

\subsection{Loss functions and Training}
SiamRegQC is trained with a two-step training procedure as shown in Figure~\ref{fig:arc-differences}. (i)  The encoder, $\phi$ of SiamRegQC, is trained with an autoencoding task. Here, the MSE loss between a single input image and its decoded output is minimized. 
(ii)  The entire network (encoder, $\phi$ + classifier, $Clf$) is trained end-to-end on a classification task based on the categorical labels (aligned- class 0 or misaligned- class 1) assigned to a pair of target and moving images as the input. We use a CrossEntropy loss, $L_{ce\_loss}$, and a modified cosine similarity-based contrastive loss~\cite{Chen2020ASF}, $L_{con\_loss}$, to drive the training. Hence, the combined loss, $L_{total}$, for training our Siamese network is formulated as:
\small
\begin{equation}
\label{loss functions}
{L_{total} = L_{ce\_loss}(pred, label) + 
L_{con\_loss}(\phi(img), \phi(ref), label)}
\end{equation}
where, $L_{con\_loss}$ is formulated as:
\begin{equation}
\label{eq:Con-Loss}
\begin{split}
L_{con\_loss}~(\phi(ref), \phi(img), label) = label\cdot(1 - Cos\_Sim(\phi(ref), \phi(img)))^2 + \\ 
(1-label)\cdot Cos\_Sim((\phi(ref), \phi(img)))^2
\end{split}
\end{equation}
\normalsize
where, $pred$ is the predicted class, and $label$ is the true class for the input pair of images $img$ and $ref$.

\subsection{Registration Quality Evaluation Metrics}
\label{sec:equations}
We leverage the cosine similarity measure between the Siamese encodings of registered and target images to formulate SiamRegQC. The quality of the registration between a registered image, $img$, and a target image, $ref$, is formulated using the following equations:
\small
\begin{align}
\label{Cos}
& Cos\_Sim(\phi(ref), \phi(img)) = {\frac{\phi(ref)\cdot \phi(img)}{||\phi(ref)||_{2}\cdot||\phi(img)||_{2}}}
\end{align}
\\[-30pt]
\begin{align}
\label{SiamRegQC_eq}
& \mathrm{SiamRegQC}(ref, img) = 1 - Cos\_Sim(\phi(ref), \phi(img))
\end{align}
\normalsize
A lower value of SiamRegQC closer to 0 suggests that the input pair of images is well-registered, while a value closer to 2 (the cosine similarity function can take a minimum value of -1)suggests that the images are misaligned.

\subsection{Dataset  and Implementation Details}
\label{dataset}
To evaluate registration quality in the context of a downstream task, we use a 0.05mm downsampled version of the high-resolution Nissl-stained histological adult mouse brain data with 200 coronal sections, distributed by the Allen Brain Institute \cite{doi:10.1086/596246}. We use 3000 coronal sections of adult human brain MRI volumes from the IXI dataset\footnote{\url{https://brain-development.org/ixi-dataset/}}, each with a pixel resolution of 1mm to validate our method on a larger dataset. All volumes are skull-stripped, bias-corrected, and intensity normalized as described by \cite{chen2021transmorph}. All the volumes are zero-padded and center-cropped, with each 2D section of size $256\times256$. 

We use random rotations and translations to generate synthetic rigid misaligned images, while pixel-wise deformed images are generated using random smooth Gaussian flow fields (Refer Figure~\ref{fig:def-flowchart}) All the network training experiments are run on Nvidia GeForce GTX 1660 Ti GPU with 14GB RAM for 5 epochs, in a 5-fold cross-validation method with an Adam optimizer \cite{kingma2014adam} of initial learning rate 0.001. 
% Since the transformations are in physical dimensions, our transformations are independent of the image size. Hence, we can perform registration quality evaluation on downsampled images for computational efficiency.
\section{Experiments and Results}
\label{sec:experiments}
In this section, we demonstrate the effectiveness of SiamRegQC in two steps: (i) Explore the desirable aspects of SiamRegQC~-~robustness to local minima, sensitivity to misalignment errors, and consistency with visual inspection, as mentioned earlier. (ii) Application of SiamRegQC as a similarity measure to drive unsupervised non-rigid deformable registration on multimodal data. 

\begin{figure}[t!]
    \floatconts
    {fig:figure_loss}
    { \caption{Surface Plot Analysis of registration quality evaluation metrics over the rigid misalignment space. While MSE and SSIM show the presence of local minima distinctly, SiamRegQC exhibits a fairly smoother surface. $\delta$ refers to the mean sensitivity of the metric to a change of 0.1~mm, $1^\circ$ translation, and rotation errors, respectively. SiamRegQC shows maximum sensitivity to misalignment errors, $\delta$.}}
    {
    \subfigure[MSE ($\delta$ = 3.1e-5)]{ \includegraphics[width=0.25\linewidth]{Plots/MSE_Loss.eps}
    \captionsetup{aboveskip=30pt}
    \label{fig:l_mse}
    }
    \subfigure[SSIM ($\delta$ = 0.0015)]{\includegraphics[width=0.25\linewidth]{Plots/SSIM_Loss.eps}}
    \label{fig:l_ssim}
    \subfigure[SiamRegQC$_{np\_nl}$ ($\delta$~=~0.0005)]{\includegraphics[width=0.25\linewidth]{Plots/SiamRegQC_Npt_Loss.eps}} 
    \label{fig:l1} 
    \subfigure[SiamRegQC$_{np}$ ($\delta$~=~0.0008)]{\includegraphics[width=0.25\linewidth]{Plots/SiamRegQC_Nptcos_Loss.eps}} 
    \label{fig:l2} 
    \subfigure[SiamRegQC$_{nl}$ \newline ($\delta$ = 0.0012)]{\includegraphics[width=0.25\linewidth]{Plots/SiamRegQC_Loss.eps}} 
    \label{fig:l3} 
    \subfigure[SiamRegQC \newline ($\delta$ = 0.0019)]{\includegraphics[width=0.25\linewidth]{Plots/SiamRegQC_cos_Loss.eps}}
    }
\end{figure}
 \begin{figure}[t!]
    \centering
    \subfigure[NISSL Dataset; Translation Error: 0.25 mm, 0.5 mm; MSE: 0.002, 0.003; SSIM: 0.659, 0.631; SiamRegQC: 0.052, 0.132.]{\includegraphics[width=1.\linewidth]{Plots/sense_trial_title_new.eps}} \\[1pt]
    \subfigure[MRI Dataset; Rotation Error: $4^\circ$, $14^\circ$; MSE: 0.002, 0.005; SSIM: 0.843, 0.774; SiamRegQC: 0.011, 0.090.]{\includegraphics[width=1.\linewidth]{Plots/sense_trial_MRI_new.eps}}
    \caption{Examples showing quality evaluation maps for overlapping registered and target images. Columns 1, 2: Green is the target image, $ref$; red is the registered image to be evaluated, $img$, and yellow represents the overlapping regions between $ref$ and $img$.; Columns 3, 4: Pixel-wise MSE map, $||ref - img||_{2}$; Columns 5, 6 and Columns 7, 8: Channels 2 and 11 of Siamese network encoded feature activation maps,$||\phi(ref) - \phi(img)||_{2}$, that delineate the misalignment errors at the image boundaries.}
    \label{fig:FAM}
\end{figure}
\subsection{Desirable Aspects of SiamRegQC for critical registration evaluation}
\label{sec:desirable-aspects}
In this section, we discuss the desirable aspects of SiamRegQC that lead to improved registration optimization when used as a deep-similarity metric for VoxelMorph registration, as discussed in Section~\ref{sec:multimodal}. \\[1pt]
\textbf{Robustness to local minima}- We study the variation of metric values for a maximum translation error of 0.4 mm and rotation error of 4 degrees, as shown in Figure~\ref{fig:figure_loss}. Ideally, the metric values on the surface plot should vary smoothly to avoid inconsistent evaluation. We find that MSE and SSIM are not immune to local minima, most visibly seen around no translation error and minimum rotation error. Meanwhile, all the variants of SiamRegQC show smoothly varying metric values in the rigid misalignment space. $\mathrm{SiamRegQC}_{np}$, $\mathrm{SiamRegQC}_{nl}$ and $\mathrm{SiamRegQC}_{np\_nl}$ represent ablated versions of pre-training the encoder, $\phi$ and using $L_{con\_loss}$ while training SiamRegQC. Here, ``$np$", ``$nl$" indicates ``no-pre-training", and ``no-contrastive-loss", respectively. Therefore, this sensitivity analysis also stands as an ablative study to show the importance of pretraining and the addition of contrastive loss.
% \label{sec:1}
\\[1pt]
\textbf{Sensitivity of Evaluation Metric and Perceptual Visualization of Misalignment Errors}- From Figure~\ref{fig:figure_loss}, we calculate the mean local variance of the evaluation metric over a unit area in space to grossly represent the sensitivity of each metric, $\delta$, to every degree change of rotation error and 0.1 mm of translation error. SiamRegQC exhibits the maximum sensitivity with $\delta$~=~0.0019, closely followed by SSIM with $\delta$~=~0.0015. Figure~\ref{fig:FAM} shows that the difference between SiamRegQC feature maps is visually more intuitive of the misalignment error than MSE difference maps. Notice that SiamRegQC has the highest metric difference for a translation error difference of 0.25~mm for Figure~\ref{fig:FAM}a, further illustrating its high sensitivity aspect. The consistency of SiamRegQC with visual inspection and application to rigid registration algorithms are covered in Appendix~\ref{appendix-desirable}.
% \label{sec:2}
\subsection{SiamRegQC as a deep similarity metric for unsupervised multimodal non-rigid registration}
\label{sec:multimodal}
\begin{figure}[t!]
    \centering
    \includegraphics[width=1.\linewidth]{Plots/Registration framework.pdf}
    \caption{Incorporating SiamRegQC as loss function in unsupervised image registration framework and comparison with other traditional and deep similarity metrics.}
    \label{fig:registration framework}
\end{figure}

In this section, we use SiamRegQC as a deep-similarity-based loss function to drive unsupervised non-rigid deformable VoxelMorph registration trained on pairs of intra-subject multi-contrast T1 and T2 MRI images from the IXI dataset to test the effectiveness of our model on multimodal data, as shown in Figure~\ref{fig:registration framework}. Figure~\ref{fig:def-flowchart} in Appendix~\ref{appendix:nonrigid deformation} shows the process of generating synthetic deformable transformations for training SiamRegQC. Table~{\ref{tab:multimodal-metrics}} and Figure~{\ref{fig:scatterplot-multimodal}} show that SiamRegQC performs competitively better than other deep similarity metrics~\cite{Cheng2018DeepSL},~\cite{Czolbe2023SemanticSM} and traditional multimodal metrics like Normalized Cross Correlation (NCC) and MIND \mbox{\cite{Heinrich2022VoxelmorphGB}}. Figure~\ref{fig:multimodal-registration} shows corresponding qualitative examples of multimodal registration. Further exploration into other multimodal datasets and varied registration frameworks seems to be an interesting topic for future lines of work. A similar unimodal registration for IXI T1-MRI data is detailed in Appendix~{\ref{appendix:nonrigid deformation}}, which shows that SiamRegQC performs better than other similarity metrics. 

\begin{table}[t!]
\centering
\resizebox{\textwidth}{!}{%
\begin{tabular}{cccccccc}
\hline
Evaluation Metric &
  \begin{tabular}[c]{@{}c@{}}Deformed Image\\ (Before Registration)\end{tabular} &
  $\mathrm{VXM_{ncc}}$ &
  $\mathrm{VXM_{MIND}}$ &
  $\mathrm{VXM_{Cheng}}$ &
  $\mathrm{VXM_{DeepSim}}$ &
  $\mathrm{VXM_{SiamRegQC}}$ &
  ANTsPy \\ \hline
MSE &
  $0.022 \pm 0.021$ &
  $0.012 \pm 0.006$ &
  {\color[HTML]{3531FF} $0.011 \pm 0.005$} &
  $0.012 \pm 0.005$ &
  $0.012 \pm 0.005$ &
  {\color[HTML]{009901} $0.010 \pm 0.004$} &
  {\color[HTML]{3531FF} $0.011 \pm 0.005$} \\ \hline
NCC &
  $0.65 \pm 0.11$ &
  $0.821 \pm 0.033$ &
  $0.815 \pm 0.033$ &
  {\color[HTML]{009901} $0.837 \pm 0.005$} &
  $0.814 \pm 0.032$ &
  {\color[HTML]{3531FF}$0.825 \pm 0.027$} &
  {\color[HTML]{009901} $0.902 \pm 0.022$} \\ \hline
SSIM &
  $0.699 \pm 0.134$ &
  $0.817 \pm 0.067$ &
  $0.822 \pm 0.063$ &
  $0. 834 \pm 0.059$ &
  {\color[HTML]{333333} $0.828 \pm 0.062$} &
  {\color[HTML]{009901} $0.845 \pm 0.056$} &
  {\color[HTML]{009901} $0.865 \pm 0.053$} \\ \hline
\end{tabular}%
}
\caption{Quantitative Evaluation of SiamRegQC before and after registration with other traditional and deep similarity metrics for MRI T1 to T2 multimodal data. Green highlights the best evaluation metric performance, and blue highlights second best performance.}
\label{tab:multimodal-metrics}
\end{table}

\begin{figure}[t!]
    \centering
    \includegraphics[width=1.\linewidth]{Plots/scatter_plots_multimodal.pdf}
    \caption{Improvement in NCC and SSIM scores of SiamRegQC with different similarity loss functions for MRI T1~to~T2 multimodal images. SiamRegQC shows competitive improvements compared to other deep similarity-based loss functions and is closest to the reference ANTsPy performance. All data points above the dashed line suggest improvement in the registration performance.}
    \label{fig:scatterplot-multimodal}
\end{figure}

\begin{figure}[t!]
    \centering
    \includegraphics[width=1.\linewidth]{Plots/Copy of deformable_MIND.pdf}
    \caption{Registered images from different VoxelMorph networks supervised with various traditional and deep-similarity-based loss functions for MRI T1~to~T2 multimodal images.}
    \label{fig:multimodal-registration}
\end{figure}
 
\section{Conclusion and Future Work}
In this work, we take a first step towards utilizing Siamese network-encoded representations for registration quality evaluation. We analyze our results from different perspectives. Our proposed data-driven, deep learning-based evaluation metric, SiamRegQC, is less affected by local minima and offers well-delineated registration quality visualization maps closer to human perception than pixel-wise MSE maps. SiamRegQC shows increased sensitivity to even smaller misalignment errors while maintaining consistency of values for visibly well-registered images. SiamRegQC allows for evaluating and benchmarking registration methods with statistical significance. Finally, SiamRegQC exhibits superior unsupervised deformable registration performance compared to previously proposed deep similarity metrics for unimodal and multimodal data. From a broader perspective, our paper opens up interesting directions to formulate evaluation strategies using data-driven representation learning beyond medical image registration.
% % Acknowledgments---Will not appear in anonymized version
% \midlacknowledgments{We thank a bunch of people.}
\bibliography{midl23_227}

\appendix

\section{Previous Related Works}
\label{appendix:lit-survey}
In the recent past, learning a similarity metric with supervised CNNs for optimizing a registration algorithm has been studied in many works. A CNN architecture with two-channel input has been proposed for the classification of aligned and misaligned multimodal images~\mbox{\cite{Cheng2018DeepSL}}. They leveraged the predicted class as a probabilistic similarity score. Further improvement of the novel similarity metric has been explored by performing the misalignment classification on smaller patches of the image for better localization~\mbox{\cite{Simonovsky2016ADM}}. Furthermore, the authors of~\mbox{\cite{Simonovsky2016ADM}} demonstrated the advantages of their proposed similarity metric by actually using it to drive a continuous optimization registration framework. DeepSim, another data-driven similarity metric, has recently been proposed for registration~\mbox{\cite{Czolbe2023SemanticSM}}. DeepSim uses autoencoders trained on an unsupervised autoencoding task guided with an MSE Loss for deep feature extraction of the target and moving images. While the earlier deep similarity metrics proposed by~\mbox{\cite{Cheng2018DeepSL}} and~\mbox{\cite{Simonovsky2016ADM}} benefit from training their two-channel CNNs in the context of registration misalignment with a binary classification task, DeepSim~\mbox{\cite{Czolbe2023SemanticSM}} might have the advantage of learning more effective semantic features from a more complex autoencoding task. In this work, we aim to combine these advantages and propose an enhanced deep similarity metric using Siamese network encoders for learning complex semantic features from a misalignment classification task. Refer to Figure~{\ref{fig:arc-differences}} for a graphical illustration of the architecture differences of SiamRegQC from the previously proposed methods. Another key difference between SiamRegQC and the previous metrics is that SiamRegQC is trained in two steps. (i)  The encoder, $\phi$ of SiamRegQC, is trained with an unsupervised autoencoding task on a single input MRI image. 
(ii)  The entire network (pre-trained encoder + classifier, $Clf$) is trained end-to-end on a classification task based on the categorical labels (aligned or misaligned) assigned to a pair of target and moving images as the input. 

\section{Rationale for choosing the architecture for SiamRegQC}
\label{appendix:rationale}
In this section, we explain the rationale for using a Siamese network architecture as the backbone for semantic feature extraction of the target and moving images. We further discuss the simplicity of using a classification task instead of a regression task at the final stage of the architecture, as shown in Figure~{\ref{fig:arc-differences}}.

\subsection{Rationale for using Siamese Networks}
The dual-encoder architecture of SiamRegQC provides the ability to visualize both the target and moving image in a similar latent space. This property allows us the flexibility to learn a similarity metric between their latent space encoded features, whereas the two-input channel CNN architecture earlier~\mbox{\cite{Cheng2018DeepSL}},~\mbox{\cite{Simonovsky2016ADM}} only allows a single latent space representation for the input images. Unlike DeepSim's CNN autoencoder, Siamese networks provide a more efficient way of extracting features from similar input images with their weight-sharing property~\mbox{\cite{Bromley1993SignatureVU}}. Table~{\ref{tab:multimodal-metrics}} and Table~{\ref{tab:unsupervised-metrics}} show that the dual-encoder architecture of SiamRegQC competitive registration performance for multimodal and unimodal datasets. 

\subsection{Rationale for using a final classification task instead of regression task}
Training a supervised DL network aims to learn meaningful semantic similarities and provide a numerical quality measure for the registration between a pair of images. We exploit the intermediate Siamese encodings to measure a numerical cosine similarity value as the similarity between the input target and the moving images. 
Since the regression task of autoencoding used to train only the encoders already learns semantic representations of a single input image in the first training stage, we use a second stage of training to provide the context of the aligned and misaligned input pairs. The classification task assists the framework to be oriented to the categorical (aligned or misaligned informed) decision-making step that increases the sensitivity of SiamRegQC to misalignment errors and minimizes the risk of losing essential misalignment information. Adding a contrastive loss (increases discrimination between aligned and misaligned pairs) that leverages these categorical labels is more straightforward in a classification framework than in a regression framework. Also, the self-supervised classification task conceptually aligns with the previous works of ~\mbox{\cite{Cheng2018DeepSL}},~\mbox{\cite{Simonovsky2016ADM}}. Hence, we opt for a classification task to train SiamRegQC end-to-end in the second training step.

\section{Desirable aspects of SiamRegQC for critical registration evaluation (Continued Section~3.1)}
\label{appendix-desirable}
In this section, we continue our analysis of SiamRegQC as a sensitive and critical metric with some qualitative examples and registration algorithms for the section-wise registration of histological sections.
\\[1pt]
\textbf{Consistency of Evaluation Metric}- In Section~\ref{sec:desirable-aspects}, we have seen that SiamRegQC can distinguish between visibly misaligned images with increased sensitivity, even seen in Columns 1 and 2 of Figure~\ref{fig:cases}. This section discusses another important aspect of an evaluation metric to maintain a consistent value for visibly well-registered images.  From Columns 4 and 5 of Figure~\ref{fig:cases}, we see that MSE and SSIM have largely different numerical values (inconsistent) for visibly well-registered images, while SiamRegQC can consistently evaluate them. Note that the metric values for MSE and SiamRegQC (except in the last Column) are scaled by dividing them with the value for the smallest translation misalignment error of 0.001~mm for better interpretation.
% \label{sec:3}    
\begin{figure}[t!]
    \centering
    \includegraphics[width=1.\linewidth]{Plots/New_exp_in_scaled.eps}
    \caption{Evaluation of SITK registered images starting from random initial misalignment errors to demonstrate the consistency of SiamRegQC metric. Columns 1~to~3: Registered images in decreasing order of misalignment errors as seen by visual inspection. Columns 4, and 5: Visually well-registered images show inconsistent MSE and SSIM metrics, while SiamRegQC shows a consistent metric value. Last column: Ground truth registration solution for a given target image.}
    \label{fig:cases}
\end{figure}

\textbf{Statistical Benchmarking Performance of 2D Rigid Registration Algorithms of Nissl-stained Histological Volume Reconstruction-}
\label{BenchMark_Reg}
In 3D histological volume reconstruction, the reconstructed volume is formed by successively registering neighboring 2D serial sections to one another \cite{8575935}. In this instance, even small sub-pixel registration errors in the successive 2D section-wise registrations can accumulate over a number of sections, possibly resulting in a skewed or distorted 3D volume \cite{Lobachev2021EvaluatingRO}. While MSE and SSIM tend to overlook such small errors, SiamRegQC can pick on them with increased sensitivity, as seen in Figure~\ref{fig:figure_loss}. We consider any pair of consecutive coronal sections of the Nissl-stained mouse brain dataset as a ground truth pair of sections (Refer Appendix.~{\ref{appendix:stack-alignment}} for more details). We benchmark three registration algorithms by comparing their results to the original pair of sections with the Welch two-sample t-test. A registration algorithm can be termed ``good" when a mean evaluation metric of the registered pair of sections is as close to the mean metric value of the ground-truth pair of sections as possible. Statistically, the p-value associated with the Welch test helps in assigning a confidence value to the ``goodness" of registration. A p-value less than 0.05 suggests that the results of the registration algorithm are significantly different from the ground-truth images, indicating a low confidence value to evaluate a registered algorithm as ``good." Alternatively, a p-value greater than 0.05 shows that a registration algorithm is ``good" with more confidence. From Table~{\ref{tab:benchmark}}, a relatively poorly performing FFT \mbox{\cite{Reddy1996AnFT}} method becomes a trivial case of evaluation, which even the traditional metrics can adequately prove with higher mean differences with the ground truth, p-value $<<$ 0.05 and high absolute $T_{stat}$ values of 15.4 for MSE and 10.1 for SSIM respectively. The overall well-performing SITK \mbox{\cite{Yaniv2017SimpleITKIN}} and SIFT \mbox{\cite{Lowe2004DistinctiveIF}} algorithms become non-trivial cases of evaluation. Especially for the SITK algorithm, while MSE, SSIM, and SiamRegQC$_{np\_nl}$ show that the registration outputs are ``good" with a higher confidence score (p-value greater than 0.05, highlighted in red), the other variants of SiamRegQC are still critical about their confidence (highlighted in blue) of SITK being a ``good" algorithm. This shows that SiamRegQC considers cases as shown in~\ref{fig:cases}, Column 1 with high criticality and hence shows conservative p-value confidence about the ``goodness" of the SITK algorithm. These benefits of SiamRegQC are further explored when SiamRegQC shows better registration performance than previous deep similarity metrics as seen in Section~\ref{sec:multimodal}.

\begin{table}[t!]
\scriptsize
\centering
\resizebox{\textwidth}{!}{%
\begin{tabular}{cccccc}
\hline
\multicolumn{2}{c}{}                                        & \multicolumn{4}{c}{Registration Algorithm}                                                 \\ \cline{3-6} 
\multicolumn{2}{c}{\multirow{-2}{*}{Evaluation Metric}}     & Ground Truth      & SIFT              & SITK                           & FFT               \\ \hline
                                         & $\mu \pm \sigma$ & 0.001 $\pm$ 0.001 & 0.002 $\pm$ 0.001 & 0.001 $\pm$ 0.001              & 0.007  0.005      \\
                                         & $T_{stat}$       & -                 & 3.7               & {\color[HTML]{FE0000} 0.1}     & 15.4              \\
\multirow{-3}{*}{MSE}                    & $p_{val}$        & -                 & 2.6e-04           & {\color[HTML]{FE0000} 0.91}    & 1e-36             \\ \hline
                                         & $\mu \pm \sigma$ & 0.896 $\pm$ 0.060 & 0.860 $\pm$ 0.071 & 0.890 $\pm$ 0.066              & 0.782 $\pm$ 0.099 \\
                                         & $T_{stat}$       & -                 & -3.9              & {\color[HTML]{FE0000} -0.65}   & -10.1             \\
\multirow{-3}{*}{SSIM}                   & $p_{val}$        & -                 & 1e-04             & {\color[HTML]{FE0000} 0.52}    & 4.9e-19           \\ \hline
                                         & $\mu \pm \sigma$ & 0.004 $\pm$ 0.005 & 0.006 $\pm$ 0.006 & 0.005 $\pm$ 0.005              & 0.114 $\pm$ 0.105 \\
                                         & $T_{stat}$       & -                 & 4.9               & 1.2                            & 15.0              \\
\multirow{-3}{*}{SiamRegQC$_{np\_nl}$} & $p_{val}$        & -                 & 1.7e-e-06         & 0.12                           & 3.4e-35           \\ \hline
                                         & $\mu \pm \sigma$ & 0.010 $\pm$ 0.010 & 0.015 $\pm$ 0.010 & 0.012 $\pm$ 0.010              & 0.163 $\pm$ 0.127 \\
                                         & $T_{stat}$       & -                 & 5.9               & 2.3                            & 17.3              \\
\multirow{-3}{*}{SiamRegQC$_{np}$} & $p_{val}$ & - & 7.9e-09 & {\color[HTML]{3531FF} 0.02}    & 3.5e-42 \\ \hline
                                         & $\mu \pm \sigma$ & 0.017 $\pm$ 0.014 & 0.029 $\pm$ 0.017 & 0.023 $\pm$ 0.013              & 0.209 $\pm$ 0.133 \\
                                         & $T_{stat}$       & -                 & 7.3               & 3.4                            & 20.1              \\
\multirow{-3}{*}{SiamRegQC$_{nl}$ }                & $p_{val}$        & -                 & 2.7e-12           & {\color[HTML]{3531FF} 2.2e-04} & 3.2e-53           \\ \hline
                                         & $\mu \pm \sigma$ & 0.024 $\pm$ 0.017 & 0.039 $\pm$ 0.023 & 0.031 $\pm$ 0.021              & 0.264 $\pm$ 0.145 \\
                                         & $T_{stat}$       & -                 & 7.2               & 4.0                            & 23.7              \\
\multirow{-3}{*}{SiamRegQC}          & $p_{val}$ & - & 1.9e-12 & {\color[HTML]{3531FF} 7.6e-05} & 7.9e-62 \\ \hline
\end{tabular}%
}  
\caption{Welch two-sample T-test between registration algorithms and ground truth images. MSE and SSIM rate SITK as a 'good' registration method with a p-value greater than 0.05, while SiamRegQC provides a statistically significant evaluation with a p-value greater than 0.05.}
\label{tab:benchmark}
\end{table}

\section{Evaluation of Section-wise Registration of Nissl Mouse Brain Volume Reconstruction}
\label{appendix:stack-alignment}
The process of histological volume reconstruction with section-wise 2D registrations is detailed and traditionally evaluated with metrics like MSE and SSIM, as shown by~\cite{Lobachev2021EvaluatingRO}. The section-wise registration process can be summarized as follows.
Consider a set of $n$ serial section histology images, $\mathrm{S_{1}}$, $\mathrm{S_{2}}$, ... $\mathrm{S_{n}}$, to be aligned to form a 3D volume, $\mathrm{V}$. We begin with the middle section of the histological stack, $\mathrm{S_{n//2}}$ as a reference and perform pairwise registrations of serial sections as given below:
\begin{equation}
\label{eq:recon}
    \mathrm{Aligned\;S_{i} =} 
\begin{cases}
    \mathrm{T(S_{i}, S_{i-1})}, & \text{if } i\geq n\\
    \mathrm{T(S_{i}, S_{i+1})}, & \text{if } i < n\\    
\end{cases}            
\end{equation}

where, 
$\mathrm{T}$ is the rigid transform that registers $\mathrm{S_{i}}$ to $\mathrm{S_{i\pm1}}$.
\begin{figure}[t!]
    \centering
    \includegraphics[width=1.\linewidth]{Plots/Nissl Reconstruction.png}
    \caption{Ground truth pairwise sections extracted from the original 3D Nissl-stained volume. The sections are synthetically misaligned and registered back using three different algorithms: SIFT, SITK, and FFT. SiamRegQC is used to benchmark the algorithms critically.}
    \label{fig:nissl-gt}
\end{figure}

In Section~\ref{BenchMark_Reg} and Table~\ref{tab:benchmark}, pairs ($\mathrm{S_{i}}$, $\mathrm{S_{i \pm 1}}$) from the Nissl dataset are considered as the ground-truth sections. We synthetically misalign successive serial sections as described in Figure~\ref{fig:nissl-gt} and use three different registration algorithms viz., featured-based (SIFT) \cite{Lowe2004DistinctiveIF}, intensity-based (SITK) \cite{Yaniv2017SimpleITKIN}, and FFT-based registration \cite{Reddy1996AnFT} to register them together. The experimental results and performances of the mentioned algorithms are detailed in Section~\ref{BenchMark_Reg}.

\section{SiamRegQC as a deep similarity metric for unsupervised non-rigid deformable registration for unimodal images}
\label{appendix:nonrigid deformation}
In this section, we evaluate the effectiveness of SiamRegQC on a more complex downstream task of non-rigid deformable registration. We first describe the process of generating synthetic non-rigid misaligned images for training SiamRegQC. Later, we study the effect of using SiamRegQC as a similarity metric to drive unsupervised non-rigid registration in comparison with previous deep similarity metrics for unimodal data.

\subsection{Generation of Non-Rigid Deformable Transformations}
To simulate non-rigid misaligned images for training SiamRegQC, we generate random flow-field grids and resample the input image with bilinear interpolation to get non-rigid deformed images, as shown in Figure~\ref{fig:def-flowchart}.
The generated random flow fields are smoothened and scaled randomly with parameters $\alpha$ and $\sigma$, respectively. We train SiamRegQC with original and synthetically created misaligned pairs of images as described in Section~\ref{sec:methodology}. We use about 3000 T1 MRI sections from the same IXI dataset described in Section~\ref{dataset} for training SiamRegQC with non-rigid misaligned and aligned images.

\begin{figure}[t!]
    \centering
    \includegraphics[width=1.\linewidth]{Plots/deformable flowchart (1).png}
    \caption{Left: Graphical representation of synthetically generating non-rigid misaligned images. $\alpha$ and $\sigma$ denote the parameters used to control the scale and smoothness of the deformed (misaligned) grid. Right: Examples of generated non-rigid misaligned images.}
    \label{fig:def-flowchart}
\end{figure}

\subsection{Comparison with previous Deep Similarity Metrics}
To test the effectiveness of SiamRegQC beyond rigid registration and quality evaluation as described in Section~{\ref{sec:experiments}}, we study the effects of leveraging SiamRegQC as a similarity metric (interchangeably referred to as loss function) to drive unsupervised non-rigid deformable registration. We use a VoxelMorph~\mbox{\cite{balakrishnan2019voxelmorph}} architecture for setting up a learning-based unsupervised deformable registration framework, as shown in Figure~{\ref{fig:registration framework}}. We compare SiamRegQC with other deep similarity metrics like DeepSim~\mbox{\cite{Czolbe2023SemanticSM}} and the two-channel CNN-based metric proposed by~\mbox{\cite{Cheng2018DeepSL}}. More details on these metrics and their methodological differences with SiamRegQC are discussed in Appendix~{\ref{appendix:lit-survey}},~{\ref{appendix:rationale}}.

\begin{table}[t!]
\centering
\resizebox{\textwidth}{!}{%
\begin{tabular}{cccccccc}
\hline
Evaluation Metric &
  \begin{tabular}[c]{@{}c@{}}Deformed Image\\ (Before Registration)\end{tabular} &
  $\mathrm{VXM_{mse}}$ &
  $\mathrm{VXM_{ncc}}$ &
  $\mathrm{VXM_{Cheng}}$ &
  $\mathrm{VXM_{DeepSim}}$ &
  $\mathrm{VXM_{SiamRegQC}}$ &
  ANTsPy \\ \hline
MSE &
  $0.016 \pm 0.017$ &
  $0.0027 \pm 0.007$ &
  $0.0013 \pm 0.006$ &
  $0.0024 \pm 0.007$ &
  $0.0023 \pm 0.006$ &
  {\color[HTML]{3531FF} $0.0011 \pm 0.005$} &
  {\color[HTML]{3531FF} $0.0010 \pm 0.004$} \\ \hline
SSIM &
  $0.699 \pm 0.134$ &
  $0.935 \pm 0.127$ &
  $0.962 \pm 0.123$ &
  $0.920 \pm 0.124$ &
  $0.936 \pm 0.127$ &
  {\color[HTML]{3531FF} $0.967 \pm 0.109$} &
  {\color[HTML]{3531FF} $0.987 \pm 0.031$} \\ \hline
\end{tabular}%
}
\caption{Quantitative Evaluation of SiamRegQC before and after registration with other traditional and deep similarity metrics. Considering the non-learning-based ANTsPy as a reference registration method, SiamRegQC shows the closest performance to ANTsPy (highlighted in blue).}
\label{tab:unsupervised-metrics}
\end{table}
\begin{figure}[t!]
    \centering
    \includegraphics[width=1.\linewidth]{Plots/scatter_plots.pdf}
    \caption{(Top row) Improvement in MSE metrics after using SiamRegQC as a similarity-based loss function for unsupervised deformable registration. Each data point represents the evaluation metric for one pair of registered and target images. The dashed line represents identity transformation. All data points below the dashed line suggest improvement in the registration performance. (Bottom row) Improvement in SSIM metrics after using SiamRegQC as a similarity-based loss function for unsupervised deformable registration. All data points above the dashed line suggest improvement in the registration performance. Each registration method name is denoted as VXM, with the subscript indicating the type of similarity metric used as the loss function. For instance, $\mathrm{VXM_{mse}}$ denotes VoxelMorph method with MSE as the loss function.}
    \label{fig:scatterplot-T1}
\end{figure}
\begin{figure}[t!]
    \centering
    \includegraphics[width=1.\linewidth]{Plots/Copy of deformable_images (1).png}
    \caption{Registered images from different VoxelMorph networks supervised with various traditional and deep-similarity-based loss functions. Except for the input target and moving images, the VoxelMorph outputs are displayed as differences from the target images for better visualization of the registration error.}
    \label{fig:deformable_images_registered}
\end{figure}
Table~{\ref{tab:unsupervised-metrics}} shows the quantitative evaluation that SiamRegQC performs better than other traditional and deep similarity metrics with a mean SSIM value of 0.967 (highlighted in blue), closely followed by the Normalized Cross Correlation (NCC) optimized VoxelMorph network. From Figure~{\ref{fig:scatterplot-T1}}, ANTsPy shows the least amount of dispersion and data points close to 0 for MSE and close to 1 for SSIM. From the learning-based VoxelMorph networks trained with different loss terms, SiamRegQC shows the closest dispersion to ANTsPy.
Although the non-learning-based traditional ANTsPy \mbox{\cite{avants2009advanced}} registration method shows the best MSE and SSIM metrics, ANTsPy works with an average inference time of 3 minutes for every registration run. Whereas $\mathrm{VXM_{SiamRegQC}}$ records an average inference time of 0.18 seconds for a single registration run. The qualitative results supporting SiamRegQC's superior performance are described in Figure~{\ref{fig:deformable_images_registered}}.
\end{document}