\documentclass{midl} % Include author names
\usepackage{mwe}
\jmlryear{2024}
\jmlrworkshop{Full Paper -- MIDL 2024}
\jmlrvolume{-- nnn}
\editors{Accepted for publication at MIDL 2024}


\newcommand{\N}[0]{\mathcal{N}}
\renewcommand{\P}[0]{\mathcal{P}}
\newcommand{\Ad}[0]{{\rm Ad}}
\newcommand{\ad}[0]{{\rm ad}}
\newcommand{\Diff}[0]{{\rm Diff}}
\renewcommand{\div}[0]{{\rm div}}
\newcommand{\Dist}[0]{{\rm Dist}}
\newcommand{\Reg}[0]{{\rm Reg}}
\newcommand{\Inn}[0]{{\rm Inn}}
\newcommand{\id}[0]{{\rm id}}
\DeclareMathOperator{\E}{E}
\DeclareMathOperator{\Var}{Var}
\newcommand{\sym}[0]{{\rm sym}}
\newcommand{\Exp}[0]{{\rm Exp}}
% \newcommand{\S}[0]{{\mathbf{\hat{S}}}}
% \usepackage[boxed, ruled, vlined, linesnumbered]{algorithm2e}
% \usepackage{mathrsfs}
% \usepackage{amsmath,amssymb}

\title[Motion Estimation and Geometric Correction in Fetal Images]{Joint Motion Estimation with Geometric Deformation Correction for Fetal Echo Planar Images Via Deep Learning}

\midlauthor{\Name{Jian Wang\midljointauthortext{Contributed equally}\nametag{$^{1}$}} \Email{jian.wang@childrens.harvard.edu}\\
\Name{Razieh Faghihpirayesh\midlotherjointauthor\nametag{$^{1,2}$}} \Email{raziehfaghih@ece.neu.edu}\\
\Name{Deniz Erdo{\u{g}}mu\c{s}\nametag{$^{2}$}} \Email{d.erdogmus@northeastern.edu}\\
\Name{Ali Gholpour\nametag{$^{1}$}} \Email{ali.gholipour@childrens.harvard.edu}\\ \\
\addr $^{1}$ Boston Children's Hospital and Harvard Medical School, Boston, MA\\
\addr $^{2}$ Northeastern University, Boston, MA \AND
}

\begin{document}

\maketitle

\begin{abstract}
In this paper, we introduce a novel end-to-end predictive model for efficient fetal motion correction using deep neural networks. Diverging from conventional methods that estimate fetal brain motions and geometric distortions separately, our approach introduces a newly developed joint learning framework that not only reliably estimates various degrees of rigid movements, but also effectively corrects local geometric distortions of fetal brain images. Specifically, we first develop a method to learn rigid motion through a closed-form update integrated into network training. Subsequently, we incorporate a diffeomorphic deformation estimation model to guide the motion correction network, particularly in regions where local distortions and deformations occur. To the best of our knowledge, our study is the first to simultaneously track fetal motion and address geometric deformations in fetal echo-planar images. We validated our model using real fetal functional magnetic resonance imaging data with simulated and real motions. Our method demonstrates significant practical value to measure, track, and correct fetal motion in fetal MRI. 
\end{abstract}

\begin{keywords}
Deep Learning, Fetal MRI, Rigid Motion Correction, Geometric Deformation. 
\end{keywords}

\section{Introduction}
Motion estimation is a critical procedure, meticulously designed to correct image artifacts induced by object motion, especially in the realm of medical image analysis for fetal imaging. 
Its critical role spans across various fields, including image segmentation~\cite{ebner2020automated, faghihpirayesh2022deep}, reconstruction~\cite{gholipour2010robust, kuklisova2012reconstruction, xu2023nesvor}, and pose estimation~\cite{salehi2018real,golland20203d}, among others. Effective motion compensation and correction techniques significantly contribute to the overall accuracy and efficiency of fetal imaging studies~\cite{malamateniou2013motion}. Estimating fetal motion presents inherent challenges, notably the unpredictable nature of fetal head movements and local geometric distortions due to accumulated imaging errors. To overcome these challenges, existing motion correction methods mostly fall into two distinct categories: i) iterative optimization methods based on mathematical models, and ii) predictive models learned by deep neural networks.
% \paragraph{Related Works.}

One of the initial paradigms of 3D fetal brain reconstruction using motion correction was a three-step model~\cite{rousseau2006registration} involving multi-resolution slice alignment for motion correction, intensity non-uniformity correction, and super-resolution reconstruction through scattered data interpolation. To address motion errors from 2D slice misalignments, a slice-to-volume registration (SVR) model was introduced~\cite{jiang2007mri}, incorporating a scattered data interpolation method with a novel multi-level B-spline kernel. A motion correction technique was presented in~\cite{kim2009intersection} to align image stacks based on 2D slice intersections to facilitate the reconstruction of a high isotropic resolution 3D volume. A forward model of slice acquisition and its inverse problem solution were proposed in~\cite{gholipour2010robust}. This approach featured a robust M-estimation solution to reduce the effects of corrupted (outlier) slices in a super-resolution reconstruction framework. Building upon these developments, an SVR approach was presented to encompass complete outlier removal through robust statistics based on the expectation-maximization algorithm~\cite{kuklisova2012reconstruction}. Tourbier et al.~\cite{tourbier2015efficient} used total variation regularized SVR, solved using the primal-dual hybrid gradient method. While these advancements have enhanced the accuracy and efficiency of motion correction in image reconstruction, all gradient-based optimization methods for slice realignments have a very limited capture range, often failing in cases of large and rapid fetal movements.

Motivated by the progress in deep learning, a spectrum of models has emerged, with a particular focus on advancing predictive motion estimation~\cite{mahendran20173d}. Hou et al. developed a method for predicting 3D rigid transformations of arbitrarily oriented 2D slices~\cite{hou20183}. This pursuit expanded with the introduction of a real-time fetal motion tracking system, leveraging the strengths of diverse neural networks, including Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), to predict motion directly from input images~\cite{salehi2018real,singh2020deep,evan2022keymorph}. 

To address non-rigid deformations in fetal MRI, two distinct approaches were presented. These include the patch-to-volume reconstruction technique~\cite{alansary2017pvr} and the deformable SVR technique~\cite{uus2020deformable}, both designed to correct non-rigid motion to reconstruct deformable anatomy such as fetal body parts. In the domain of automated fetal MRI reconstruction, Ebner et al. developed a toolkit that featured slice-level outlier rejection, fetal brain localization, and volume reconstruction~\cite{ebner2020automated}. 

Noteworthy progress has been achieved in the domain of sensorless imaging, particularly in 3D volume reconstruction from 2D freehand ultrasound images, leveraging deep implicit representations~\cite{yeung2021implicitvol}. Xu et al. incorporated transformers~\cite{vaswani2017attention} trained on synthetically transformed data, streamlining automatic relevance detection between slices~\cite{xu2022svort}. Advancing this approach, Xu et al. further introduced a reconstruction technique that incorporates implicit neural representations, enhancing image reconstruction performance~\cite{xu2023nesvor}.
Evan et al. introduced KeyMorph, an unsupervised deep learning framework designed for robust and interpretable multi-modal medical image registration~\cite{evan2022keymorph}. Utilizing anatomically-consistent keypoints, KeyMorph aims to enhance the registration process and increase alignment accuracy.
Moyer et al. introduced a real-time method for tracking rigid motion by employing equivariant neural networks~\cite{moyer2021equivariant}. The effectiveness of this approach in capturing significant rigid motions is derived from the intrinsic rotation-equivariant nature of equivariant filters~\cite{cohen2016group}.

Despite the notable performance achieved by the aforementioned methods, there is currently a lack of an end-to-end predictive motion correction framework that can adequately address both uncontrollable fetal motion and geometric distortions. This gap in the existing literature motivated our endeavor to develop a comprehensive framework proficient in efficiently managing fetal motions and geometric distortions through deep learning. The core contributions of our proposed motion correction method are summarized in three folds: 
\begin{itemize}
\item \textbf{Pioneering contribution}: Our study introduces the first comprehensive end-to-end predictive motion tracking framework adept at handling both significant rigid motion and geometric distortions in fetal echo planar imaging (EPI).  
\item \textbf{Significance}: Our proposed framework not only achieves motion correction accuracy comparable to current state-of-the-art methods but also ensures faster and more stable convergence. This attribute is of significant value, especially in real-time, automated fetal head steering systems, where frequent and substantial motions occur.
\item \textbf{Theoretical advancement for broad applicability}: The theoretical tool developed through our joint learning algorithm demonstrates wide applicability, offering advantages in various fetal imaging applications. This includes real-time fetal brain segmentation, facilitating population-based studies of fetal brain development using mean template estimation, and improving EPI image reconstruction with advanced distortion correction techniques.
\end{itemize}

\section{Methodology}
In this section, we first outline the theoretical basis needed to address rigid motions and deformations in fetal imaging. 
We start by reviewing the principles of rigid motion estimation. Next, we focus on correcting geometric distortions, specifically using Large Deformation Diffeomorphic Metric Mapping (LDDMM), chosen for its effectiveness in creating smooth, invertible mappings that maximally maintain the correctness of topological information~\cite{beg2005computing}. Finally, we describe the design of our proposed network architecture and the approach to its training. 


\subsection{Rigid Motion Estimation}
Rigid motion estimation aims to identify the best translation $\mathcal{T}$ and rotation $\mathcal{R}$ parameters that define a rigid transformation $Q(\mathcal{T},\mathcal{R})$ between a pairwise images. This process involves minimizing the Euclidean distance $d$ between the source image $S$ and the target image $T$,
\begin{equation}
\label{eq:imRigid} 
Q^{\ast}(\mathcal{T},\mathcal{R}) = \text{arg} \min_Q \ d [S \circ Q (\mathcal{T},\mathcal{R}), T] ,
\end{equation} 
where $\circ$ is the composition operator that resamples $S$ using the rigid transformation $Q(\mathcal{T},\mathcal{R})$. When this operator is applied to any vector $\textbf{v}$, it yields a transformed vector $Q(\textbf{v})$ of the form $Q(\textbf{v}) = \mathcal{R} \textbf{v} + \mathcal{T}$. Here, $\mathcal{R}^T = \mathcal{R}^{-1}$ indicating that $\mathcal{R}$ is an orthogonal matrix. 

The transformation function $Q$ can be calculated using low-dimensional representations, such as point clouds or key points of images~\cite{moyer2021equivariant,evan2022keymorph}. This process is quantified by the energy function $\E (Q)$, as shown in the equation,
\begin{equation}
\label{eq:RigidEnergy} 
\E (Q) = \Vert S \circ Q (\mathcal{T},\mathcal{R}) - T \Vert _ {F}, 
\end{equation} 
where $\Vert\cdot\Vert_F$ denotes the Frobenius norm, and $\bar{S}$, $\bar{T}$ represent the low-dimensional representations of the source and target respectively. We derive the closed-form solution for both translation and rotation parameters,
\begin{equation}
\label{eq:closeform} 
\mathcal{T} = \bar{T} - \mathcal{R}\bar{S},\quad \mathcal {R} = V \cdot {U}^T,\quad \text{s.t.} \, \, \det (\mathcal{R}) = 1,
\end{equation} 
where $ {U} \Sigma V^* = \bar{S} \cdot \bar{T}^{T}$, $U$ and $V^*$ are real orthogonal matrices, $\Sigma$ is a diagonal matrix with non-negative real numbers on the diagonal. Setting the determinant of $\mathcal{R}$ to $1$ guarantees that it accurately reflects a rigid transformation.

\subsection{Geometric Deformation Correction via LDDMM}
In this section, we provide an overview of the LDDMM algorithm for image registration~\cite{beg2005computing}. This algorithm is employed to address deformable geometric distortions between the rigid motion-corrected image $S \circ Q (\mathcal{T},\mathcal{R})$ and the target image $T$.
For simplicity, we denote $\mathbf{\hat{S}}$ as $S \circ  Q (\mathcal{T},\mathcal{R})$.

Given both $\mathbf{\hat{S}}$ and $T$, defined on a $d$-dimensional torus domain $\Gamma = \mathbb{R}^d / \mathbb{Z}^d$ ($\mathbf{\hat{S}} (x), T(x) : \Gamma \rightarrow \mathbb{R}$), the objective of diffeomorphic image registration is to find the shortest path to generate time-varying diffeomorphisms $\{\psi_t(x)\}: t \in [0,1] $ such that $\mathbf{\hat{S}} \circ \psi_1$ is similar to $T$. This is typically solved by minimizing an explicit energy function over the initial transformation fields $v_0$~\cite{vialard2012}, which can be expressed as follows,
\begin{equation}
\label{eq:imReg} 
\E(v_0) = \frac{1}{\sigma^2}\Vert \mathbf{\hat{S}} \circ \psi_1(v_0) - T \Vert ^{2}_{2} + (L v_0, v_0),
\end{equation} 
where $\sigma^2$ is noise variance in images and $L: V\rightarrow V^{*}$ is a symmetric, positive-definite differential operator that maps a tangent vector $ v_t\in V$ into its dual space as 
a momentum vector $m_t \in V^*$. This is typically denoted as
$m_t = L v_t$ or $v_t = K m_t$, with $K$ being an inverse operator of $L$. The notation $(\cdot, \cdot)$ denotes the pairing of a momentum vector with a tangent vector, which is similar to an inner product.  

The geodesic shooting process states that the geodesic path $\{\psi_t\}$ can be uniquely determined by integrating a given initial velocity $v_0$ forward in time by using the Euler-Poincar\'e differential (EPDiff) equation~\cite{arnold1966,miller2006geodesic} as:
\begin{equation}\label{eq:distance}
        \frac{\partial v_t}{\partial t} = - K \left[(D v_t)^T \cdot m_t + D m_t \cdot v_t + m_t \cdot \operatorname{div} v_t \right] ,   \quad \frac{d\psi_t}{dt} = - D\psi_t\cdot v_t, 
\end{equation}
where the operator $D$ denotes a Jacobian matrix and $\cdot$ represents element-wise matrix multiplication. Here, $\operatorname{div}$ is the divergence. 

% \subsection{Objective Function}
% We write the total objective of motion and geometric deformation estimation by combining Eq.~\eqref{eq:RigidEnergy} and Eq.~\eqref{eq:imReg},  
% \begin{align}
% \label{eq:total} 
% \E(Q, v_0) =  \lambda \Vert \mathbf{\hat{S}} - T \Vert _ {F} +  \frac{1}{\sigma^2} \Vert \mathbf{\hat{S}} \circ \psi_1(v_0) - T \Vert ^{2}_{2} + (L v_0, v_0) 
% \quad & \text{s.t.} \, \, \text{Eq.~\eqref{eq:closeform}}, \ \& \ \text{Eq.~\eqref{eq:distance}}.
% \end{align}
% $\lambda$ denotes a weighting parameter balancing the effects of the rigid motion and geometric distortion correction. We could also adopt normalized cross correlation~\cite{avants2008symmetric} and mutual
% information~\cite{wells1996multi,wang2023metamorph} as image distance terms . 

%Existing methods oftentimes estimate rigid and geometric transformations in separate tasks that involve iterative optimization methods that are not robust and efficient ~\cite{alansary2017pvr,uus2020deformable}.
%, restricting the clinical applicability of such models in real-time interventions with time constraints, such as image-guided fetal head steering navigation system~\cite{faghihpirayesh2022deep}. This limitation motivated us to propose an predictive comprehensive motion estimation framework by deep neural networks to efficiently correct both large rigid motions and geometric distortions.

\subsection{Network Architecture and Training}
%Neural Network Training for Joint Motion and Deformation Correction}
We develop a deep learning model that explicitly estimates both rigid motion and geometric distortions by integrating a combination of Eq.~\eqref{eq:RigidEnergy} and Eq.~\eqref{eq:imReg} into our objective function. We show that such joint estimation results in improved accuracy and robustness of the model. Our framework consists of two modules: an enhanced rigid motion correction neural network that corrects rigid motions with closed-form update, and an unsupervised learning of geometric distortion correction through an image registration network.
An illustration of our joint learning framework is shown in Fig.~\ref{fig:network}. Below, we provide a detailed description of our network architecture and objective function.
\begin{figure}[!bt]
\begin{center}
 \includegraphics[width=.90\textwidth] {figures/Netarch.pdf}
     \caption{An illustration of our proposed \textbf{joint correction network (JCN)} architecture. The top module performs rigid motion correction on pairs of images using closed-form updates derived from their key point representations. The bottom module corrects geometric distortions in the aligned and target images, employing a geometric loss function with a regularization term in diffeomorphic transformation space. Both models are interrelated—the geometrically corrected data is fed back as augmented data to enhance the accuracy of rigid motion correction. This, in turn, aligns better with the geometric distortion network, facilitating the correction of local distortions in corresponding positions of the fetal brain.}
\label{fig:network}
\end{center}             
\end{figure}
\paragraph*{\bf Enhanced motion correction network.} 
Let $\Theta = ({\mathcal{T},\mathcal{R}})$ represent the encoder parameters that learn rigid parameters and key features from image spaces, with $Q(\Theta)$ denoting the transformation function yielded by the learned rigid parameters from the key points of images. The rigid correction loss is computed between the aligned and target images. $\Theta$ is characterized using one of two backbone structures, (i) a 7-layer equivariant neural network to effectively capture rotation-equivaraint representations using 3D steerable CNNs~\cite{weiler20183d,moyer2021equivariant}, and (ii) a conventional 9-layer CNN as employed in KeyMorph~\cite{evan2022keymorph}. 

\paragraph*{\bf Geometric distortion correction based on image registration network.} 
Let $\Phi$ denote the parameters of an encoder-decoder in our geometric learning network, where $\psi(\Phi)$ represents the deformation fields and $v_0(\Phi)$ represents the velocity fields learned by the network. Except for estimating deformations by LDDMM~\cite{hinkle2018lagomorph}, we also provide another deep learning model that learns stationary velocity fields~\cite{balakrishnan2019voxelmorph} meanwhile maintains a comparable model accuracy. Advanced predictive image registration models, including TransMorph~\cite{chen2022transmorph} and DiffuseMorph~\cite{kim2022diffusemorph}, can easily be integrated into our framework.

\paragraph*{\bf Joint correction network and objective function.}
Our joint correction network (JCN) comes in two versions: JCN: EF, which employs equivariant filters, and JCN: KM, based on the CNN architecture that is similar to KeyMorph.
Our total objective function, which serves as the network's training loss, combines both the motion correction component (Eq.~\eqref{eq:RigidEnergy}) and the geometric deformation estimation (Eq.~\eqref{eq:imReg}), formulated as follows,
\begin{align}
\label{eq:total_net} 
l(\Theta, \Phi) =& \underbrace {\lambda \Vert S \circ {Q(\Theta)}  - T \Vert _ {F}}_{l_\text{rigid}} +  \underbrace {\frac{1}{\sigma^2} \Vert S \circ Q(\Theta) \circ \psi(\Phi)  - T \Vert ^{2}_{2} + (L v_0(\Phi), v_0(\Phi))}_{l_\text{geo}} \nonumber  \\ 
\, \, & + \text{reg} (\rm{{\Theta}, \Phi}), \quad \quad \text{s.t.} \, \, \text{Eq.~\eqref{eq:closeform}}, \ \& \ \text{Eq.~\eqref{eq:distance}} .
\end{align}
% \begin{align}
% \label{eq:total_net} 
% l(\Theta, \Phi) =& \lambda \Vert S \circ {Q(\Theta)}  - T \Vert _ {F} +  \frac{1}{\sigma^2} \Vert S \circ Q(\Theta) \circ \psi(\Phi)  - T \Vert ^{2}_{2} + (L v_0(\Phi), v_0(\Phi)) \nonumber  \\ 
% \, \, & + \text{reg} (\rm{{\Theta}, \Phi}), \ \ \text{s.t.} \, \, \text{Eq.~\eqref{eq:closeform}}, \ \& \ \text{Eq.~\eqref{eq:distance}} .
% \end{align}
Here, $\text{reg}(\cdot)$ is a regularization term constrained on network parameters, and $\lambda$ is a weighting factor balancing the effects of both networks. Other image dissimilarity terms, such as normalized cross correlation~\cite{avants2008symmetric} and mutual information~\cite{wells1996multi,wang2023metamorph} can be adopted in our framework.
% \begin{algorithm2e}[!h]
% \SetAlgoLined
% \SetArgSty{textnormal}
% \SetKwInOut{Input}{Input}
% \SetKwInOut{Output}{Output}
% \DontPrintSemicolon
% \Input{A pairwise of image source $S$ and target $T$ }
% \Output{Predicted rigid parameters $\mathcal{R}$, $\mathcal{T}$ and distortion-corrected image with transformation fields $v_0$.}
%  \For{$i=1$ to $r$} {
%  \tcc{Train enhanced rigid motion correction network}
% Minimize the energy of rigid motion by Eq.~\eqref{eq:netlossG} with close-form updates;
% Generate motion-corrected images using produced motion parameters; 
% \tcc{Train geometric distortion correction network}
% Minimizing the geometric distortion correction network by Eq.~\eqref{eq:netlossC};
% Output the distortion-corrected image via applying the deformation fields onto it;
% }
% \textbf{Until} convergence
% % Until convergence
% \caption{Joint learning of enhanced motion correction.} \label{alg1}
% \end{algorithm2e}

\section{Experimental Setup}
% \subsection{Data}
\paragraph*{\bf Data.}
Our study includes $1,881$ pairs of 3D EPIs from fMRI time series of $15$ subjects who underwent fetal MRI scans using a Siemens 3T scanner (Skyra or Prisma) between August 2015 and September 2021. Approved by the institutional review board, the study obtained written informed consent from all participants. The dataset covers a gestational age range from $22.57$ to $38.14$ weeks (mean $32.39$ weeks). Imaging parameters included a slice thickness of $2$ to $3 mm$, TR of $2$ to $5.6$ seconds (mean $3.1$ seconds), TE of $0.03$ to $0.08$ seconds (mean $0.04$ seconds), and a FA of $90$ degrees. Fetal brains were extracted using a real-time deep learning segmentation method~\cite{faghihpirayesh2023fetal}, resampled to $96^3$ with a voxel resolution of $3 mm^3$, and underwent intensity normalization.
\paragraph*{\bf Baselines \& Evaluation Metrics.}
We compared various motion correction methods, including conventional iterative rigid registration in ITK to estimate rigid transformation as 3D versors~\cite{ibanez2003itk}, and rigid motion estimation by deep learning methods such as DeepPose~\cite{salehi2018real} and KeyMorph~\cite{evan2022keymorph}. We compared their performance with our joint models JCN: KM and JCN. To demonstrate the benefit of our joint learning, we report the best Dice score of disjoint approaches, where rigid motion estimation is treated as a preprocessing step before geometric distortion estimation. Performance was evaluated through visual comparison and by computing translational and angular errors on simulated motions. We also manually added translations (in mm) and rotations (in degrees) to real fMRI scans with natural fetal motions, altering their maximum motion values in three directions. To better demonstrate our model's stability, we compared the Dice coefficient by computing the overlapping regions between the aligned image and the target across three levels of motion\footnote{Our code is released online, https://github.com/bchimagine/JointMotionTracking}.

\paragraph*{\bf Hyperparameters.} 
We set the dimension for low-dimensional key points to $128$ when computing the close-form update of rigid transformation in Eq.~\eqref{eq:closeform} and set $\alpha=3$ for the operator $L$ with using $5$ as time steps of Euler integration for geodesic shooting (Eq.~\eqref{eq:distance}). The noise variance was fixed at $\sigma = 0.02$. For network training, we set batch size as $4$, weight decay as 0.0001 for a $L_2$ regularization, and a learning rate of $\eta = 1e-5$ for $300$ epochs with Adam optimizer. We divided the data into training (1283 pairs from 11 subjects), validation (299 pairs from 2 subjects), and testing (299 pairs from 2 subjects). The best-performing networks were saved based on validation performance across all models.  All experiments were conducted with an NVIDIA RTX A6000 GPU.
\section{Results}
\begin{figure}[!bt]
\begin{center}
 \includegraphics[width=.83\textwidth] {figures/examples_3views.pdf}
     \caption{Fetal motion correction visualizations for all methods. Left to right: source, target (the contour is highlighted as red), motion-corrected images by our methods (JCN: KM and JCN: EF), the iterative conventional method, DeepPose, and KeyMorph. \textbf{Dice scores}, \textbf{0.95 for JCN: KM}, \textbf{0.97 for JCN: EF}, 0.32 for conventional, 0.63 for DeepPose, and 0.73 for KeyMorph. Brief Dice score comparison between the best model of disjoint ones and ours, 0.90/ \textbf{0.97}. Motion correction efficiency with average time consumption: \textbf{0.491s} per pair.}
\label{fig:example}
\end{center}             
\end{figure}
Fig.~\ref{fig:example} presents a motion correction study for pairwise fMRI scans characterized by significant motions. It shows that the conventional and DeepPose methods fail to correctly align images, while our approach excels, closely matching the target image. Our method  effectively handles both rigid motion and local geometric deformations, leveraging estimated velocity fields for more accurate correction.

Fig.~\ref{fig:errorbar} illustrates the comparative analysis of motion correction errors in translation and rotation across different methods. It highlights the enhanced performance of our proposed technique in correcting fetal motions, surpassing the state-of-the-art methods.   
\begin{figure}[!bt]
\begin{center}
 \includegraphics[width=.80\textwidth] {figures/errorbar.pdf}
     \caption{Motion correction comparison of translation and rotation errors (averaged for all directions) of 299 real fMRI scans with simulated motions. }
\label{fig:errorbar}
\end{center}             
\end{figure}
\begin{figure}[!bt]
\begin{center}
 \includegraphics[width=.80\textwidth] {figures/dice.pdf}
     \caption{Motion correction performance by dice evaluation on real fMRI with different degrees of motions, small ($\mathcal{T}_{max} = 10 \text{mm}$, $\mathcal{R}_{max} = 5^{\circ}$), medium ($\mathcal{T}_{max} = 20 \text{mm}, \mathcal{R}_{max} = 10^{\circ}$) and large motions ($\mathcal{T}_{max} = 30 \text{mm}$, $\mathcal{R}_{max} = 20^{\circ}$). The dice score of our best model, for motion levels from left to right are, 0.97, 0.93, 0.92.}
\label{fig:dice}
\end{center}             
\end{figure}

Fig.~\ref{fig:dice} shows the dice coefficient comparisons for fMRI time sequences under various motion conditions. It demonstrates that our method consistently achieves higher dice scores, regardless of the motion degree. This highlights the robustness and stability of our model, particularly in challenging scenarios with significant motion occurrences.

\section{Conclusion}
This paper presents a pioneering predictive model for fetal motion correction using deep neural networks. In contrast to conventional approaches that independently estimate fetal brain motions and geometric distortions, our proposed method adopts an efficient and robust joint learning framework. This framework excels in achieving optimal and consistent performance across various degrees of fetal motions. The model demonstrates notable effectiveness on fMRI data with both simulated and real motions, showcasing significant potential in real-time tracking and steering systems for fetal head motion. An intriguing avenue for future exploration involves applying our developed model to image reconstruction for EPI data, which poses more challenging fetal motion artifacts.

% Acknowledgments---Will not appear in anonymized version
\midlacknowledgments{This research was supported in part by the National Institute of Biomedical Imaging and Bioengineering, the National Institute of Neurological Disorders and Stroke, and Eunice Kennedy Shriver National Institute of Child Health and Human Development of the National Institutes of Health (NIH) under award numbers R01NS106030, R01EB031849, R01EB032366, and R01HD109395; and in part by the Office of the Director of the NIH under award number S10OD025111. This research was also partly supported by NVIDIA Corporation and utilized NVIDIA RTX A6000 and RTX A5000 GPUs. The content of this publication is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or NVIDIA.}

\bibliography{midl24_52}

\end{document}
