\section{Inter-subject registration on the NIMH T1w dataset}
\label{sec:nimh_t1}

The National Institute of Mental Health (NIMH) Data Archive uses human subject data collected from hundreds of research projects across many scientific domains.
We use the Research Volunteer Dataset that characterizes healthy adult research volunteers in clinical assessments using mood-related psychometrics, cognitive function neurophysiological tests, structural and functional MRI, DRI, and MEG.
We use a subset of the T1w MRI dataset for inter-subject registration.

T1w MRI provides excellent gray/white matter contrast, and is routinely used for structural segmentation, and morphometry.
Similar to most methods, we use overlap of cortical and subcortical structures for evaluation.
The LUMIR challenge's evaluation protocol uses a tool called SLANT ~\citep{slant} that uses a deep learning model to segment a T1 MRI scan into 133 labels based on the BrainCOLOR protocol ~\citep{braincolor}.
However, there are a few issues with using SLANT for evaluation. 
% too many labels
\textit{First}, SLANT produces 133 labels, but is trained only on 45 T1-weighted MRI scans from the OASIS dataset.  
This can lead to a significant lack of generalization to other modalities like T2w, T2*, FLAIR, and Ultra High Field (UHF) MRI.
While label fusion can help, systematic bias in the multi-atlas fusion step can propagate into the learned UNet. 
We observe degradation in performance of the SLANT algorithm on the Ultracortex dataset. % (see \autoref{sec:ultracortex}).
\textit{Second}, measures like Dice Scores are highly sensitive to the volume of the structures, and consequently the choice of interpolation method can significantly affect the score.
For example, in our experiments, changing the interpolation method from trilinear to nearest neighbor in VFA leads to a drop of about 10 points in Dice score for the SLANT labelmaps.
To ensure fair comparison, we fix the label interpolation scheme wherein we first convert each labelmap to a binary mask, perform trilinear interpolation to obtain probability maps for each label, and for each voxel select the label with the highest probability.
This interpolation scheme avoids blocky artifacts introduced by nearest neighbor interpolation, considers partial volume effects of the probability maps, and assigns a single label to each voxel.
\textit{Third}, our previous work ~\cite{magicormirage} shows that the mutual information between images and label maps is correlated with Dice Score of registration.
The BrainCOLOR protocol used in SLANT provides extensive fine grained structures including sulcal/gyri boundaries, and various lobe boundaries, which cannot be delineated by intensity features alone.
This can lead to Dice scores of registration methods capturing spurious associations rather than anatomical relationships that can be delineated by intensity features since we are interested in evaluating intensity-based registration methods.

\textbf{Evaluation}. To address all these issues, we choose three labelling protocols with varying degrees of granularity and anatomical coverage: (1) SLANT as used in the original LUMIR challenge, (2) SynthSeg ~\citep{synthseg} for a comprehensive segmentation of various subcortical structures while segmenting the cerebral cortex as a single label for each hemisphere, and (3) DeepAtropos ~\citep{deepatropos} that provides a coarse six label segmentation of CSF, GM, WM, deep GM, brainstem, and cerebellum. 
We randomly choose 100 subjects from the dataset, resample to 1mm isotropic resolution, and apply all three segmentation protocols to obtain labelmaps.
This provides us a total of 9900 image pairs for evaluation.
To provide robust estimates, we crop the bottom five percentile of the Dice scores for each registration method. 
% Non-robust estimates are in the Appendix, and do not change the conclusions of the paper.
We provide common statistical measures (mean, median, standard deviation) for the Dice scores of the three registration methods in \autoref{tab:t1_nimh_table}, and violin plots in \autoref{fig:nimh_t1_labelmaps}.

\textbf{Significance Tests}.
To evaluate the practical impact of the differences in labelling protocols, we perform a paired t-test and a Wilcoxon signed rank test between the Dice scores of the three registration methods.
For such a high sample size, statistical significance ($p < 0.05$) is almost guaranteed for any difference, and we observed p-values lower than $10^{-4}$ for all method pairs.
To report statistical significance, we take inspiration from \citet{klein2009evaluation} and perform permutation tests \citep{menke2004using} to determine if the means of a small set of independent overlap values obtained by each of the registration methods are the same.
The subset of brain pairs was selected so that each brain was used only once, and we fixed the number of permutations to 1024, and calculate 10,000 p-values for each method pair.
We report the fraction of p-values less than 0.05 (represented as $\mu$) for each method pair as a proxy for statistical significance, as suggested by \citet{klein2009evaluation}, with higher values indicating greater statistical significance.
To measure practical impact, we measure Cohen's d (that represents effect sizes) for practical significance ($d > 0.2$) for each pair of registration methods.

\begin{figure}[h!]
    \centering
    \includegraphics[width=0.55\linewidth]{figures/nimh_t1.pdf} \hfill
    \includegraphics[width=0.44\linewidth]{figures/nihm_t1_cohens_d.pdf}
    \caption{\small \textbf{Comparison of the three registration methods on the NIMH T1w dataset.} \textbf{Left} shows violin plots of the Dice scores of the top iterative and deep learning registration methods on the NIMH T1w dataset. \textbf{Right} shows Cohen's d scores for all method pairs, quantifying the practical significance of the differences in Dice scores between the three registration methods.}
    \label{fig:nimh_t1_labelmaps}
    % \caption{Comparison of the three labelling protocols on T1w MRI scans.}
\end{figure}

\begin{table}
    \resizebox{\linewidth}{!}{%
    \begin{tabular}{l|ccc|ccc|ccc}
        \hline
        \multirow{2}{*}{\textbf{Method}} & \multicolumn{3}{c|}{\textbf{SLANT}} & \multicolumn{3}{c|}{\textbf{DeepAtropos}} & \multicolumn{3}{c}{\textbf{SynthSeg}} \\
        & Mean & Median & Std & Mean & Median & Std & Mean & Median & Std \\
        \hline
        Greedy & 0.7289 & 0.7393 & 0.0547 & 0.8717 & 0.8755 & 0.0219 & 0.8356 & 0.8384 & 0.0232 \\
        SyN    & 0.7090 & 0.7437 & 0.1178 & 0.8511 & 0.8735 & 0.0844 & 0.8300 & 0.8593 & 0.1183 \\
        VFA    & 0.5950 & 0.6421 & 0.1700 & 0.8764 & 0.8964 & 0.0507 & 0.8227 & 0.8575 & 0.0933 \\
        \hline
    \end{tabular}
    }
    % \caption{Registration method performance across different labelling protocols.}
    \caption{\small \textbf{Registration method performance across different labelling protocols on the NIMH T1w dataset.} {Table} shows the mean, median, and standard deviation of the Dice scores of the top three registration methods on the NIMH T1w dataset.}
    \label{tab:t1_nimh_table}
\end{table}


\textbf{Results}.
Interestingly, VFA performs significantly worse on the SLANT labelmaps (\autoref{tab:t1_nimh_table}), in direct contrast to the results in the LUMIR challenge.
Since the conditions for evaluation of the original challenge are unspecified, we can only speculate that the differences are due to preprocessing conditions and label interpolation schemes.  
On the SynthSeg and DeepAtropos labelmaps, the performance of VFA is comparable to (but still lower than) Greedy and SyN, with minor differences in the Cohen's d scores (\autoref{fig:nimh_t1_labelmaps}).
Permutation tests in \autoref{tab:nimh_perm} show that VFA significantly underperforms Greedy and SyN on the SLANT labelmaps indicated by high $\mu$ values, while lower $\mu$ values for DeepAtropos and SynthSeg labels suggest that the differences in Dice scores for these labelmaps are not statistically significant.
% 
This is an indicator that modern deep methods like VFA are able to register coarse anatomical structures well, but may struggle with highly parcellated structures. 
However, deep methods are comparable to iterative methods on inter-subject registration of in-distribution contrast, showing maturity of deep learning methods in terms of task understanding for image registration, compared to the previous generation of methods that performed well on the training data but failed to generalize to modest and practical amounts of domain shift ~\citep{magicormirage,jian2024mamba,dio,bailiang,liu2025unsupervised}.