\section{Out-of-distribution contrasts on the NIMH T1w dataset}
While T1-weighted imaging provides excellent anatomical detail with superior gray-white matter contrast ideal for morphometric analysis and structural segmentation, it exhibits limited sensitivity to many pathological processes. 
T2-weighted and FLAIR sequences offer complementary contrast mechanisms that are essential for detecting white matter lesions, edema, inflammation, and demyelination—pathologies that often have subtler appearance on T1w scans.
For example, T2* imaging is sensitive to magnetic susceptibility effects, making it useful for detecting hemorrhages, iron buildup, and other magnetic substances that are not visible in T1w scans.
T2 weighted images are particularly useful for visualizing fluid-filled structures, such as cerebrospinal fluid (CSF) and white matter lesions that appear isointense with the background on T1w scans.
FLAIR images on the other hand suppresses signal from CSF, highlighting abnormalities like lesions, tumors, and stroke against a more homogeneous background.
In these modalities, the GM-WM boundary is often less distinct than in a T1w scan.
In clinical practice, multimodal protocols combining these contrasts are standard precisely because no single sequence provides comprehensive tissue characterization: T1w reveals anatomy while T2/FLAIR/T2* reveal pathophysiology.

The LUMIR challenge uses the NIMH dataset with T1w, T2w, T2*, and FLAIR sequences for zero-shot evaluation of out-of-distribution contrasts, where they use the SLANT segmentation from the co-registered T1w images to the T2w, T2*, and FLAIR images to obtain labelmaps.
However, since the majority of labels in the SLANT segmentation are cortical parcellations and the T2w, T2*, and FLAIR images do not provide sufficient GM-WM contrast compared to T1w images, we argue that this parcellation is not representative of the registration task. 
Moreover, our experiments in \autoref{sec:nimh_t1} show that the performance of deep learning methods (VFA) on the SLANT labelmaps is significantly worse than iterative methods (Greedy and SyN).
Therefore, we consider SynthSeg for labelmap generation on the T2w, T2*, and FLAIR images.
SynthSeg is a general purpose segmentation model that is trained on a wide range of contrasts, and is suitable for accurate segmetnation for all three contrasts.
Moreover, VFA performs comparably to iterative methods on the SynthSeg labelmap on T1w images, setting a benchmark for performance comparison between in-distribution and out-of-distribution contrasts.
Initially, we segmented 438 images from each contrast, leading to a total of 191,406 (= 438 $\times$ 437) image pairs for evaluation of each contrast. 
We evaluate the average Dice Score overlap between the registered and reference labelmaps on a randomly chosen (and fixed) subset of 5,000 pairs for each contrast to reduce computational cost.

\textbf{Results}. Violin plots and pairwise Cohen's d scores are reported in \autoref{fig:nimh_ood_labelmaps} and statistical summaries are shown in \autoref{tab:ood_nimh_table}.
Compared to T1w images, the performance of the top deep learning method VFA drops significantly for unseen contrasts. 
The largest difference in performance is observed in T2w images, followed by T2* and FLAIR images.
The Cohen's d scores in \autoref{fig:nimh_ood_labelmaps} underscore that the practical impact of the performance difference is significant for all three contrasts, contrary to the results in the T1w dataset where the differences are minor and do not have significant practical impact.
For example, on the T2 and T2* images, the Cohen's d scores are in the range of 0.70-1.51, significantly outside the accepted standard for ``small effects'' ($d < 0.2$).
Permutation tests (\autoref{tab:nimh_perm}) further confirm that these performance gaps are statistically significant. 
The consistently large mean test statistics ($\mu$) for T2 and T2* reinforce that the observed differences reflect systematic performance degradation rather than sampling variability.
% 
% For T2 and T2* images, permutation tests in \autoref{tab:nimh_perm} show substantially high values of $\mu$ values indicating that the differences are also statistically significant in addition to the practical impact.
% Another notable observation is that the permutation tests and Cohen's d scores show low effect size and statistical significance for FLAIR images, which is consistent with the long tail of Dice scores in the violin plot in \autoref{fig:nimh_ood_labelmaps}.
% We hypothesize that this is because FLAIR images typically show higher contrast for T2 hyperintense pathology (WM lesions, edema, periventricular abnormalities), rather than emphasizing typical morphometric boundaries like GM-WM interface and subcortical structures.
In contrast, both Cohen's d and permutation test results indicate smaller effect sizes and weaker statistical significance for FLAIR. 
This aligns with the long-tailed Dice distribution observed in the violin plots (\autoref{fig:nimh_ood_labelmaps}), suggesting higher variability but less consistent separation between methods. 
We hypothesize that this behavior stems from the contrast properties of FLAIR imaging: FLAIR emphasizes T2-hyperintense pathology (e.g., white matter lesions, edema, periventricular abnormalities) while providing comparatively weak contrast at morphometric tissue boundaries such as the GM-WM interface and deep subcortical structures. 
The resulting boundary ambiguity likely increases registration variability of morphometric boundaries without producing a consistent directional performance gap between methods.

These results are in stark contrast to the results in the LUMIR challenge, where VFA performed \textit{better} than iterative methods on out-of-distribution contrasts, with seemingly no empirical consideration or theoretical justification for the observed difference.
% 
Domain shift is an established problem in deep learning, and is a well-studied phenomena that is pervasive in a broad range of application areas, and has garnered significant resources to systematically study and mitigate it ~\citep{beery2020iwildcam,zech2018variable,albadawy2018deep,jadon2025realdrivesim}. 
The LUMIR challenge asserts that a variety of deep learning architectures are robust to domain shift just by training on a large set of T1w images, a direct contradiction to the extensive literature on domain shift and domain adaptation in deep learning.
A crucial question therefore emerges: can this seemingly absurd conclusion from the original challenge be extended to other tasks like lung CT or abdomen registration? 
% Moreover, the LUMIR challenge analysis did not explicitly examine how registration performance may depend on image contrast, particularly comparing T1w and FLAIR, where regions of high contrast are qualitatively complementary. 
Moreover, the LUMIR challenge did not explicitly discuss or account for the interaction between registration performance and image contrast. 
In particular, T1w and FLAIR images emphasize qualitatively different structures—T1w highlights morphometric boundaries, whereas FLAIR emphasizes T2-hyperintense pathology. Because these contrasts are complementary rather than equivalent, registration difficulty and evaluation metrics may reflect modality-specific contrast properties rather than purely methodological differences.
%
Our results are consistent with the expectation that deep methods learn a distribution of \textit{T1w (task) specific} features that is then used in conjunction with registration-aware modules ~\citep{bailiang,jian2024mamba} to achieve generalization to T1w images.
Moreover, VFA exhibits significantly higher variance (a proxy for predictive variance) than Greedy and SyN for all three contrasts, which is consistent with the literature on predictive uncertainty and entropy estimation of deep learning methods on out-of-distribution data ~\citep{lakshminarayanan2017simple,maddox2019simple,malinin2018predictive}.

\begin{figure}[t!]
    \centering
    \includegraphics[width=0.55\linewidth]{figures/nimh_ood.pdf} \hfill
    \includegraphics[width=0.44\linewidth]{figures/nihm_ood_cohens_d.pdf}
    \caption{\small Comparison of the three registration methods on out-of-distribution contrasts on the NIMH dataset with labels generated by SynthSeg.}
    \label{fig:nimh_ood_labelmaps}
    \vspace*{-20pt}
\end{figure}

\begin{table}[h]
    \centering
    \resizebox{\textwidth}{!}{%
    \begin{tabular}{l|ccc|ccc|ccc}
    \hline
    \multirow{2}{*}{\textbf{Method}} & \multicolumn{3}{c|}{\textbf{T2}} & \multicolumn{3}{c|}{\textbf{T2*}} & \multicolumn{3}{c}{\textbf{FLAIR}} \\
    & \textbf{Mean} & \textbf{Median} & \textbf{Std} & \textbf{Mean} & \textbf{Median} & \textbf{Std} & \textbf{Mean} & \textbf{Median} & \textbf{Std} \\
    \hline
    Greedy & 0.7961 & 0.8038 & 0.0292 & 0.7012 & 0.7148 & 0.0608 & 0.7049 & 0.7222 & 0.0732 \\
    SyN    & 0.8203 & 0.8241 & 0.0213 & 0.7161 & 0.7292 & 0.0606 & 0.6662 & 0.7087 & 0.1245 \\
    VFA    & 0.5524 & 0.5865 & 0.1792 & 0.6072 & 0.6450 & 0.1287 & 0.6373 & 0.6907 & 0.1509 \\
    \hline
    \end{tabular}
    }
    \caption{\small Registration method performance across different out-of-distribution contrasts on the NIMH dataset with labels generated by SynthSeg.}
    \label{tab:ood_nimh_table}
\end{table}

\begin{table}
    \centering
    \begin{tabular}{lcccccc}
    \toprule
    \textbf{Method} & \textbf{SLANT} & \textbf{DeepAtropos} & \multicolumn{4}{c}{\textbf{SynthSeg}} \\
    \cmidrule(lr){1-7}
     & \textbf{T1} & \textbf{T1} & \textbf{T1} & \textbf{T2} & \textbf{T2*} & \textbf{FLAIR} \\
    \midrule
    Greedy v.s. SyN & 0.1870 & 0.6383 & 0.0031 & 0.5930 & 0.4292  & 0.0443 \\
    Greedy v.s. VFA & 1.0000 & 0.0222 & 0.1258 & 0.4244 & 0.2407 & 0.0280 \\
    SyN v.s. VFA    & 0.9781 & 0.2143 & 0.0025 & 0.6035 & 0.4629 & 0.0292 \\
    \bottomrule
    \end{tabular}
    \caption{{Statistical significance on the NIMH dataset represented as fraction of p-values less than 0.05 for each method pair using permutation tests. Higher values represent greater statistical significance.}}
    \label{tab:nimh_perm}
\end{table}