\section{Inter-subject registration on the Ultracortex dataset}
\label{sec:ultracortex}

The Ultracortex dataset ~\citep{ultracortex} includes a collection of 9.4T ultra-high field MRI images of the human brain, with resolutions varying from 0.6 to 0.8mm. The images are acquired using a 9.4T MRI scanner, and the data consists of both MP-RAGE and MP2RAGE sequences.
The dataset includes high-quality manual segmentations for 12 subjects - which include both gray and white matter segmentations for each hemisphere - leading to 4 labels.

The LUMIR challenge uses SLANT to obtain labelmaps for the Ultracortex dataset, and downsamples the images to 1mm isotropic.
However, this preprocessing has two undesirable effects.
\textit{First}, submillimeter resolution images can provide additional cytoarchitectural detail and act as a bridge between low resolution \textit{in-vivo} scans and high resolution histology images. 
Lowering the resolution can lead to loss of this information and defeats the purpose of using high-resolution scans in the first place.
Moreover, this is not representative of clinical and research workflows where high-resolution blockface scans are used as an intermediate modality between in-vivo scans and histology slides ~\citep{nextbrain,alegro2016multimodal}.
\textit{Second}, the MP2RAGE sequences in the dataset are both qualitatively and quantitatively different compared to the MP-RAGE sequences seen in the OASIS or LUMIR datasets.
This constitutes a significant source of domain shift that leads to poor performance of SLANT on the Ultracortex dataset, making it unsuitable for robust evaluation.
This aspect is not discussed and possibly unaccounted for in the original evaluation.
We examine the volumes and histograms of the subjects and show that the MP-RAGE sequences (corresponding to subjects \texttt{sub-37, sub-45, sub-57}) indeed look qualitatively different than the MP2RAGE sequences in \autoref{fig:ultracortex_vis_hist}.
Specifically, histograms of the MP2RAGE sequences are characterized by two or three peaks, close to the extreme values of the intensity range, while the MP-RAGE sequences have a more unimodal distribution with a single dominant peak.

\begin{figure}[h!]
    \centering
    \includegraphics[width=0.48\linewidth]{figures/ultracortex-vis.pdf}
    \includegraphics[width=0.51\linewidth]{figures/ultracortex-hist.pdf}
    \caption{\footnotesize \textbf{Multimodal characterization of the Ultracortex dataset.} \textbf{Left} shows axial slices of subjects from the Ultracortex dataset. Out of 12 subjects with labeled segmentations, 3 subjects have MP-RAGE sequence data, and 9 subjects have MP2RAGE sequence data. \textbf{Right} shows histograms of the intensity values of the subjects. The MP2RAGE sequences are characterized by two or three peaks close to the extreme values of the intensity range, while the MP-RAGE sequences have a more unimodal distribution with a single dominant peak. The qualitative differences in both the intensity values and histograms are indicative of the multimodal nature of the dataset.}
    \label{fig:ultracortex_vis_hist}
\end{figure}

\textbf{Evaluation}.
To account for the possible effect of domain shift on the performance of SLANT, we perform an alternative evaluation that leverages the high-quality manual segmentations already provided as part of the dataset.
MP2RAGE sequences in the dataset provide excellent gray/white matter contrast, making it a practical testbed for evaluating performance of registration algorithms.
\textbf{Resolution:} First, we affinely register all images to the \texttt{sub-3} subject's MP2RAGE image. This brings all images to a 0.6mm isotropic resolution.
Initially, we proposed evaluation of both the Greedy and SyN modes in FireANTs, and VFA on the 0.6mm isotropic resolution images.
Unfortunately, VFA runs out of memory for 0.6mm isotropic registration on a GPU with 48GB of memory, highlighting the limitations of deep learning methods for high-resolution image registration.
Therefore, we further resample the dataset and labels to 1mm isotropic resolution, and evaluate the performance of the same methods on the 1mm isotropic resolution images.
% 
\textbf{Multimodality:} Moreover, in the dataset, 3 out of the 12 subjects have MP-RAGE sequences, while the other 9 subjects have MP2RAGE sequences.
Registration of and MP-RAGE to an MP2RAGE sequence constitutes a multimodal task, and we use MIND features for FireANTs for every pair of images that have different modalities.
VFA does not support any other feature images other than intensities as input, and the evaluation for VFA remains unchanged.
We report the Dice scores on four splits of the dataset - (a) MP-RAGE to MP-RAGE ($n = 6$), (b) MP2RAGE to MP2RAGE ($n = 72$), and (c) MP-RAGE $\leftarrow\rightarrow$ MP2RAGE ($n = 54$), and (d) all subjects ($n = 132$).

\begin{figure}
    \centering
    \includegraphics[width=0.48\linewidth]{figures/ultracortex_mpra_mpra.pdf}
    \includegraphics[width=0.48\linewidth]{figures/ultracortex_mp2ra_mp2ra.pdf}
    \includegraphics[width=0.48\linewidth]{figures/ultracortex_mpra_mp2ra.pdf}
    \includegraphics[width=0.48\linewidth]{figures/ultracortex_all_splits.pdf}
    \caption{\small Comparison of the three registration methods on the Ultracortex dataset.}
    \label{fig:ultracortex_registration_results}
\end{figure}

\begin{table}[htbp]
    \centering
    \caption{\small Registration performance on Ultracortex dataset across different split types and methods}
    \label{tab:ultracortex_registration_results}
    \resizebox{\textwidth}{!}{%
    \begin{tabular}{l*{5}{c}}
    \toprule
    Split & Greedy (0.6mm) & SyN (0.6mm) & Greedy (1mm) & SyN (1mm) & VFA (1mm) \\
    \midrule
    All splits & $0.784 \pm 0.096$ & $0.794 \pm 0.073$ & $0.768 \pm 0.080$ & $0.769 \pm 0.071$ & $0.633 \pm 0.100$ \\
    MPRAGE to MPRAGE & $0.804 \pm 0.017$ & $0.803 \pm 0.015$ & $0.788 \pm 0.016$ & $0.787 \pm 0.015$ & $0.648 \pm 0.037$ \\
    MPRAGE to MP2RAGE & $0.679 \pm 0.060$ & $0.710 \pm 0.026$ & $0.674 \pm 0.014$ & $0.685 \pm 0.015$ & $0.535 \pm 0.065$ \\
    MP2RAGE to MP2RAGE & $0.860 \pm 0.012$ & $0.857 \pm 0.010$ & $0.836 \pm 0.010$ & $0.831 \pm 0.009$ & $0.706 \pm 0.051$ \\
    \bottomrule
    \end{tabular}
    }
\end{table}

\textbf{Results}.
The results in \autoref{tab:ultracortex_registration_results} and \autoref{fig:ultracortex_registration_results} highlight three key insights.
First, MPRAGE to MP2RAGE registration is a significantly harder task than either MPRAGE to MPRAGE or MP2RAGE to MP2RAGE registration, illustrated by about an 18 point drop in Dice score compared to the MP2RAGE-MP2RAGE split.
The MP2RAGE images are well poised to register gray and white matter boundaries due to the excellent contrast, reaching an average Dice score of upto 0.86 for Greedy version of FireANTs.
Second, high-resolution registration leads to around a 2 point increase in Dice score for both Greedy and SyN versions of FireANTs essentially obtained for `free' without any additional domain-specific considerations.
Third, the results show that beyond the poor generalization of a representative top performing deep learning method on out-of-distribution contrasts, the methods cannot accomodate multimodal images out-of-the-box.
Furthermore, these methods do not scale beyond 1mm isotropic resolution, limiting their applicability to the broad range of high-resolution images and the insights provided by advanced high resolution scanners, ex-vivo imaging studies, and multimodal integration.
With improved efficiency of iterative optimization methods, able to register 0.6mm isotropic images in seconds, they are well positioned to tackle the scale of high resolution imaging workflows pertinent in MRI to histology workflows.


