\section{Inter-subject registration on the PRIME-DE Macaque dataset}
\label{sec:prime_de}

A natural extension of zero-shot evaluation from the T1w human MRI is to evaluate registration performance on a human-adjacent mammalian species like the Macaque.
To that end, the PRIME-DE dataset provides a collection of T1w MRI images of the Macaque brain, with original resolutions varying from 0.3 to 0.8mm. 
This is in contrast to ~\citet{beyondlumir} which incorrectly claims that the brain images are originally acquired at 1mm isotropic resolution, indicating a potential lack of quality control in the original evaluation. 
We download data from the five different sites mentioned in the original challenge, followed by brain extraction and segmentation using the nBEST ~\citep{nbest} tool.
All subjects are affinely registered using FireANTs to a manually chosen subject with 0.3mm resolution, to maximize the field of view and resolution for subjects with lower resolution.
We obtain 116 brain images, resulting in 13,340 (= 116 $\times$ 115) image pairs for evaluation.
We include preprocessing and affine alignment scripts in our provided code.
% 
The nBEST tool provides two segmentations: (1) segmentation of three cortical labels (GM, WM, CSF) and (2) segmentation of six subcortical labels, including thalamus, caudate, putamen, pallidum, hippocampus, and amygdala.

\textbf{Results}.
We include violin plots, summary statistics, and Cohen's d scores of Dice score overlap between inter-subject registered labelmaps for both the cortical and subcortical segmentations in \autoref{fig:primede_labelmaps} and \autoref{fig:primede_cohens_d}.
% 
We note a small gap between the performance of Greedy and VFA for both cortical and subcortical segmentations, showing that modern deep learning methods demonstrate improved task understanding on a familiar modality but unseen anatomy (i.e. T1w MRI of the macaque cerebrum).
The Cohen's d scores in \autoref{fig:primede_cohens_d} show that although small, the performance difference between Greedy and VFA is of practical significance for both cortical and subcortical segmentations, with d scores of 0.308 and 1.016, significantly outside the accepted standard for ``small effects'' ($d < 0.2$).
% 
Permutation tests in \autoref{tab:primede_perm} also indicate that the difference in Dice scores between Greedy and VFA \textit{are} statistically significant for independent subsets of image pairs for cortical segmentations, and slightly less but still significant for subcortical segmentations.
However, SyN underperforms VFA \textit{significantly} for the subcortical segmentations, indicating that Greedy is a better overall choice for iterative registration.
% 
This modest performance gap is in contrast to the results in ~\citet{beyondlumir} where VFA underperformed FireANTsGreedy substantially, which could be attributed to poorly designed preprocessing conditions in the original evaluation.  
This underscores the importance of careful design of preprocessing conditions for zero-shot evaluation, and the need for a standardized evaluation protocol for inter-subject registration.

\begin{figure}
    \centering
    \includegraphics[width=0.48\linewidth]{figures/primede_tissue.pdf}
    \includegraphics[width=0.48\linewidth]{figures/primede_subcortical.pdf}
    % \caption{Comparison of the three registration methods on the PRIME-DE dataset.}
    \caption{\small \textbf{Comparison of the three registration methods on the PRIME-DE dataset.} \textbf{Left} shows violin plots of the Dice scores of tissue overlap (GM, WM, CSF), \textbf{Right} shows violin plots of the Dice scores of subcortical overlap between the registered and reference labelmaps.}
    \label{fig:primede_labelmaps}
\end{figure}

\begin{figure}
    \centering
    \begin{minipage}{0.54\linewidth}
        \resizebox{\linewidth}{!}{%
        \begin{tabular}{l*{2}{c}}
            \toprule
            Method & Tissue & Subcortical \\
            \midrule
            Greedy & $0.829 \pm 0.030$ & $0.735 \pm 0.158$ \\
            SyN & $0.823 \pm 0.032$ & $0.708 \pm 0.151$ \\
            VFA & $0.822 \pm 0.026$ & $0.729 \pm 0.165$ \\
            \bottomrule
        \end{tabular}
        }
    \end{minipage}
    \begin{minipage}{0.45\linewidth}
        \includegraphics[width=\linewidth]{figures/primede_cohens_d.pdf}
    \end{minipage}
    % \caption{Comparison of the three registration methods on the PRIME-DE dataset.}
    \caption{\small \textbf{Quantitative comparison of the three registration methods on the PRIME-DE dataset.} \textbf{Left} shows the mean, median, and standard deviation of the Dice scores of the top three registration methods on the PRIME-DE dataset. \textbf{Right} shows Cohen's d scores for all method pairs.}
    \label{fig:primede_cohens_d}
\end{figure}

\begin{table}
    \centering
    \begin{tabular}{lcc}
    \toprule
    \textbf{Method} & \textbf{Cortical} & \textbf{Subcortical} \\
    \midrule
    Greedy v.s. SyN & 1.0 & 1.0 \\
    Greedy v.s. VFA & 1.0 & 0.6873\\
    SyN v.s. VFA    & 0.0512 & 0.9990 \\
    \bottomrule
    \end{tabular}
    \caption{{Statistical significance on the PRIME-DE dataset represented as fraction of p-values less than 0.05 for each method pair using permutation tests. Higher values represent greater statistical significance.}}
    \label{tab:primede_perm}
\end{table}