A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: image registration, distributed optimization, CUDA kernels, neuroanatomy
TL;DR: we propose non-GEMM CUDA kernels and distributed primitives to scale multimodal image registration to arbitrary image sizes
Abstract: In this work, we propose FFDP, a set of IO-aware non-GEMM fused kernels supplemented with a distributed framework for image registration at unprecedented scales. Image registration is an inverse problem fundamental to biomedical and life sciences, but algorithms have not scaled in tandem with image acquisition capabilities. Our framework complements existing model parallelism techniques proposed for large-scale transformer training by optimizing non-GEMM bottlenecks and enabling convolution-aware tensor sharding. We demonstrate unprecedented capabilities by performing multimodal registration of a 100μm ex-vivo human brain MRI volume at native resolution – an inverse problem more than 570× larger than a standard clinical datum in about a minute using only 8 A6000 GPUs. FFDP accelerates existing state-of-the-art optimization and deep learning registration pipelines by upto 6 − 7× while reducing peak memory consumption by 20 − 59%. Comparative analysis on a 250μm dataset shows that FFDP can fit upto 64× larger problems than existing SOTA on a single GPU, and highlights both the performance and efficiency gains of FFDP compared to SOTA image registration methods.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 5251
Loading