\documentclass{midl} % Include author names
% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution
\usepackage{multirow}
\usepackage{booktabs}
\usepackage{enumitem}

\usepackage{xcolor}
\usepackage{ulem}  % for \sout and \uwave

\renewcommand{\equationautorefname}{Equation}
\renewcommand{\figureautorefname}{Figure}
\renewcommand{\tableautorefname}{Table}
\renewcommand{\sectionautorefname}{Section}
\renewcommand{\subsectionautorefname}{Section}
\renewcommand{\subsubsectionautorefname}{Section}

\usepackage{mwe} % to get dummy images
\jmlrvolume{}
\jmlryear{2026}
\jmlrworkshop{} %{Full Paper -- MIDL 2026 submission}
\editors{} %{Under Review for MIDL 2026}

\title[The LUMirage]{The LUMirage: An independent evaluation of zero-shot performance in the LUMIR challenge}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % Three or more authors with the same address:
 \midlauthor{\Name{Rohit Jena} \Email{rjena@seas.upenn.edu}\\
  \Name{Pratik Chaudhari} \Email{pratikac@seas.upenn.edu}\\
  \Name{James Gee} \Email{gee@upenn.edu}\\
  \addr University of Pennsylvania}

% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship
% \midlauthor{\Name{Author Name1\midljointauthortext{Contributed equally}\nametag{$^{1,2}$}} \orcid{1111-2222-3333-4444} \Email{abc@sample.edu}\\
% \addr $^{1}$ Address 1 \\
% \addr $^{2}$ Address 2 \AND
% \Name{Author Name2\midlotherjointauthor\nametag{$^{1}$}} \Email{xyz@sample.edu}\\
% \Name{Author Name3\nametag{$^{2}$}} \Email{alphabeta@example.edu}\\
% \Name{Author Name4\midljointauthortext{Contributed equally}\nametag{$^{3}$}} \Email{uvw@foo.ac.uk}\\
% \addr $^{3}$ Address 3 \AND
% \Name{Author Name5\midlotherjointauthor\nametag{$^{4}$}} \Email{fgh@bar.com}\\
% \addr $^{4}$ Address 4
% }
\begin{document}

\maketitle

\begin{abstract}
The LUMIR challenge represents an important benchmark for evaluating deformable image registration methods on large-scale neuroimaging data.
While the challenge demonstrates that modern deep learning methods achieve competitive accuracy on T1-weighted MRI, it also claims exceptional zero-shot generalization to unseen contrasts and resolutions---assertions that contradict established understanding of domain shift in deep learning.
In this paper, we perform an independent re-evaluation of these zero-shot claims using rigorous evaluation protocols while addressing potential sources of instrumentation bias.
Our findings reveal a more nuanced picture: (1) deep learning methods perform comparably to iterative optimization on in-distribution T1w images and even on human-adjacent species (macaque), demonstrating improved task understanding; (2) however, performance degrades significantly on out-of-distribution contrasts (T2, T2*, FLAIR), with Cohen's d scores ranging from 0.7--1.5, indicating substantial practical impact on downstream clinical workflows; (3) deep learning methods face scalability limitations on high-resolution data, failing to run on 0.6mm isotropic images, while iterative methods benefit from increased resolution; and (4) deep methods exhibit high sensitivity to preprocessing choices.
These results align with the well-established literature on domain shift and suggest that claims of universal zero-shot superiority require careful scrutiny.
We advocate for evaluation protocols that reflect practical clinical and research workflows rather than conditions that may inadvertently favor particular method classes.
\end{abstract}

\begin{keywords}
image registration, deep learning, instrumentation bias, neuroimaging, foundational models
\end{keywords}

\section{Introduction}

Quantitative analysis and integration of biomedical and biological data requires images to reside in a common coordinate frame.
Toward this end, deformable image registration (DIR) is a key workhorse operation in medical image analysis, enabling analysis and fusion of data into a common coordinate frame.
Deformable Image Registration is an inverse task and shares the classical problems shared by most inverse problems in computer vision - ill-posedness, susceptibility to noise and artifacts, non convexity, and a lack of well-defined ground truth solutions.
% Mathematically, given a fixed image $I_f$ and a moving image $I_m$ defined on a compact spatial domain $\Omega$, the objective is to find a transformation $\varphi: \Omega \rightarrow \Omega$ that maximizes the similarity between the moved image $I'_m = I\circ\varphi$ and the fixed image.
Most successful registration algorithms use a variational approach to find the optimal transformation using iterative optimization, subject to constraints and regularizations on the transformation.
Although these methods are very robust to a wide range of modalities, anatomies, and species, early approaches were typically implemented on CPU, making them prohibitively slow for large scale studies.
Recent methods have addressed this limitation by proposing very fast GPU implementations~\cite{fireants,mang2024claire,convexadam} that employ advanced optimizers and efficient implementations, often performing iterative optimization in under a second for clinical volumes, and scale very efficiently to large-scale problems for bespoke applications in life sciences~\cite{wang2020allen,kronman2023developmental}.
%
%
Deep learning-based methods for image registration take a fundamentally different approach by posing the inverse problem as a statistical learning problem and
using feedforward inference to substitute hundreds of iterations of an iterative optimizer with a few (tens to hundreds of) layers to predict a deformation field directly.
While deep learning methods can substantially benefit by sidestepping iterative optimization, and learning to explicitly register anatomical ROIs from auxiliary information such as labelmaps or landmarks, most deep methods suffer from generalization to out-of-domain contrasts, and resolutions.
This is a typical property of most parametric modelling encompassing all of statistical learning theory - the train and test data is assumed to be from the same distribution ~\cite{hastie2009elements,vapnik1998statistical}.
To mitigate the distribution shift issue, methods like SynthMorph~\cite{hoffmann2021synthmorph} propose training on a combinatorial space of synthetically generated volumes.
Other foundational models~\cite{tian2024unigradicon} propose training on a wide range of contrasts and anatomies to learn a general purpose registration network.
Improving robustness of deep networks to work on a long tail of unseen modalities is an area of active research.

Despite the saliency and centrality of the registration problem across many workflows in biomedical imaging, existing evaluation challenges have been limited in scope compared to other tasks in medical imaging like segmentation for providing insights into the benefits and limitations of approaches in registration in deep learning.
%existing evaluation challenges have offered valuable insights but with limited scope.
%studies performing a comprehensive evaluation of the benefits and limitations of both iterative optimization and deep learning based registration methods are relatively sparse.
The LUMIR challenge \cite{beyondlumir} aims to address these limitations with a large scale dataset and a platform for benchmarking and advancing the next generation of registration algorithms with the goal of advancing clinical workflows and neuroscience research.
The challenge evaluation shows that modern deep learning methods can achieve competitive accuracy and high inference efficiency when trained on millions of T1w MRI image pairs without additional labeled supervision.
This is a highly encouraging result, showing the maturity of deep learning methods on large scale neuroimaging datasets.
However, the paper makes a few more claims on zero-shot performance of deep networks that defy commonsense knowledge about how parametric statistical models work:
\begin{itemize}[leftmargin=*]
    \item \textbf{Training deep learning models for registration on T1w MRI brain images alone performs exceptionally well on unseen resolutions and contrasts, even outperforming methods that are specifically trained to be domain-agnostic}: This is a claim founded neither in theory nor in practice.
    It is almost universally known that deep networks suffer on out-of-distribution data (i.e. deep learning methods are not good at extrapolation), that has led to numerous contributions in domain adaptation, transfer learning, self-supervised learning, and synthetically generated datasets and environments.
    It is possible that a special design proposed in a registration network can disentangle the modality completely and possibly lead to good domain-agnostic performance.
    But in that case, only few such explicit designs should perform well on out-of-distribution images.
    However, the paper claims that \textit{all deep methods} outperform iterative methods on all out-of-distribution modalities.
    We find this claim not very plausible without additional theoretical or empirical justification, and therefore test this claim independently in the paper.

    \item \textbf{Deep learning methods are unequivocally superior to iterative optimization methods on out-of-distribution T1w images}: This claim may be plausible since newer deep learning methods use registration-specific designs that are inspired by components used in iterative optimization methods.
    Specific design decisions may contribute to robust and accurate performance, but the free lunch theorem suggests that a universally superior optimization technique on T1w images is unlikely.
\end{itemize}

In this paper, we perform a systematic and thorough re-evaluation of the claims about zero-shot performance made in the LUMIR challenge evaluation.
%
%In this paper, we perform a careful evaluation of the top performing registration method - Vector Field Attention, and the top performing variational method - FireANTs to indepdenently test the claims of zero-shot evaluation on various datasets spanning different modalities, resolutions, and species.
% We draw a different set of conclusions than in
Our conclusions are rather unsurprising, but strikingly different than in \citet{beyondlumir}.
%
\textit{First}, we observe that the performance of SOTA deep learning methods on T1 weighted MRI imaging is indeed comparable with iterative optimization methods, even on a human-adjacent species like Macaque - showing that the next generation of deep learning algorithms for registration demonstrate substantially better task understanding.
However, deep learning methods can show inferior performance on highly parcellated regions like the SLANT segmentation, compared to other segmentation labels like DeepAtropos or SynthSeg.
%Particularly, we
\textit{Second}, the task understanding does not translate to better performance on out-of-distribution contrasts, contrary to the results shown in \citet{beyondlumir}.
\textit{Third}, scalability remains an issue with deep learning methods as demonstrated on the high-resolution Ultracortex dataset, while iterative optimization methods enjoy improved performance by registering high resolution brains due to their low memory footprint.
The scalability makes iterative optimization method a more practical choice for high-resolution image registration pertinent in histopathological workflows and life sciences research.
\textit{Fourth}, we show that deep learning methods are highly sensitive to even trivial changes in image preprocessing, including retaining padding from the original dataset.
These modes of sensitivity puts a burden on the practitioner to ensure the data conforms to the emulated preprocessing standards during training, potentially limiting its applications to high variability pertinent in real clinical scenarios.

%We point a few potential flaws in the evaluation criteria.
%First, we note that the evaluation exclusively uses SLANT for segmentations across all brain datasets, with the exception of the PRIME-DE dataset.
%However,
%Furthermore, the evaluation uses SLANT to obtain segmentations for the Ultracortex dataset which contains 9.4T images that are substantially different than T1w MRI images both qualitatively and quantitatively.

\section{Evaluation Setup}

The LUMIR challenge shows zero-shot evaluation on a variety of datasets spanning different contrasts, two species, and three tasks (inter-subject, atlas-to-subject, and subject-to-atlas registration).
However, the labeled data generation and evaluation are not discussed in sufficient detail to ensure reproducibility. 
There are also few oversights in the dataset descriptions and evaluation criteria that we discuss, and consider their effect in our evaluation. 
For each dataset, we also consider the primary sources of instrumentation bias that can affect evaluation, and how we control for these conditions. 

\subsection{Primary Sources of Instrumentation Bias}

To ensure that evaluation is fair, we discuss common and data-specific sources of instrumentation biases that can affect evaluation.
Acknowledging instrumentation bias is important because the evaluation in challenge datasets may be significantly different than how a practitioner uses the methods in clinical and research workflows.
Our previous work \cite{magicormirage} shows that deep learning methods typically exhibit instrumentation bias that lead to misrepresentation of the true performance of optimization methods.
% For example, for high-resolution data like Ultracortex that have a 0.6mm resolution, a practitioner might want to register volumes at that resolution to align intricate details like the cortical boundary.
% For example, high-resolution \textit{ex vivo} MRI imaging provides richer information including finer anatomical detail, reduced partial-volume effects that can affect gray-matter white-matter segmetnation, better separation of cortical layers, improved visualization of hippocampal subfields.
% However, resampling the image to 1mm resolution to ensure uniform evaluation can degrade performance for methods that can indeed register volumes at high resolution.
% 
Primary sources of instrumentation bias include:
\begin{itemize}[leftmargin=*]
    \item \textbf{Running multimodal registration with unimodal similarity functions}: Iterative solvers will catastrophically fail if multimodal images are attempted to be registered using unimodal losses.
    For example, the Ultracortex dataset contains a mix of MP-RAGE and MP2RAGE sequences for different subjects, which are qualitatively and quantitatively distinct in terms of contrast and resolution.
    Our evaluation for iterative optimization considers the effect of choosing different similarity functions for multimodal registration.
    % For deep networks, they may return a zero displacement field or highly OOD warps if untrained on multimodal registration.
    \item \textbf{Evaluating registration algorithms on low resolution images}: Most modern registration challenges downsample the data into a standard isotropic resolution and attempt to fit their training data, due to high memory requirements.
    Two major benefits of using iterative optimization methods is their very low memory footprint at inference, and that they perform \textit{better} at high resolutions.
    In almost every practical scenario, a practitioner would desire the images to be registered at the highest resolution possible to obtain high fidelity warp fields.
    Running optimization solvers on downsampled resolutions therefore  constitutes \textit{intentional weakening} of the baseline.
    Instead, DLIR proponents must focus on improving the capabilities of deep learning to register large scale images, rather than weakening optimization solvers.
    \item \textbf{Labelmap bias due to non-existent intensity boundaries}: SLANT is used for labelling, which uses the BrainCOLOR protocol to obtain a comprehensive segmentation of the brain.
    However, cortical parcellation is performed by lifting the atlas to the subject coordinate frame; the cortical boundaries do not exist as intensity features in the in-vivo MRI.
    This leads to spurious results when Dice score is computed ~\citep{magicormirage}.
    For in-vivo intensity images, cortical and subcortical structures that \textit{can} be delineated must be included in evaluation.
\end{itemize}

The LUMIR challenge claims to perform evaluation on a variety of resolutions, but the text mentions that all datasets are resampled to the 1mm MNI space, essentially discarding the effect on performance due to the varying resolution of the datasets.
Moreover, we find that registering the PRIME-DE dataset to the MNI template is somewhat questionable, and our evaluation leads to a smaller discrepancy in performance between deep and iterative methods by improving the performance of the deep learning method.

\subsection{Choice of Baselines}
\citet{beyondlumir} predicate that the new generation of deep learning architectures surpass optimization solvers on all registration tasks. 
To carefully evaluate this claim, we consider independently evaluating the top performing methods ranked in Table 1 of the LUMIR challenge paper, with FireANTs \citep{fireants} which is reported as the best performing iterative solver.
However, at the time of writing, out of the top eight best performing methods, only \textit{two} implementations are available in the public domain: SITReg \citep{sitreg} (Rank 1) and Vector Field Attention (Rank 4) \citep{vfa}. 
% The lack of availability of the codebase for the top performing methods is a major limitation of the LUMIR challenge, and we hope that the codebase and pretrained models would be made publicly available for the community to reproduce the results.
Despite SITReg providing an open-source implementation, it does not provide user friendly interfaces for evaluation on arbitrary datasets and despite our best efforts at modifying the codebase, we could not run the trained model on our evaluation setup.
VFA on the other hand provided highly customizable configurations that allowed us to seamlessly run evaluations with minimal changes to the original codebase.
FireANTs (Rank 12) provides both CLI-based and Python-based scripts for evaluation of arbitrary datasets, and we use the Python-based script for consistency.
Therefore, we use VFA as the primary deep learning method for comparison with FireANTs.

\input{sections/nimh_t1}
\input{sections/prime_de}
\input{sections/nimh_ood}
\input{sections/ultracortex}
\input{sections/preprocessing}
\input{sections/discussion}

\clearpage  % Acknowledgements, references, and appendix do not count toward the page limit (if any)
% Acknowledgments---Will not appear in anonymized version
% \midlacknowledgments{We thank a bunch of people.}

\bibliography{ref}

\clearpage
\appendix
\input{sections/appendix}

\end{document}
