\section{Background}
\label{sec:background}
In unsupervised anomaly detection, reconstruction-based frameworks such as autoencoders (AEs) can be used to learn the distribution of healthy samples and subsequently identify samples that deviate from this norm as anomalous. The encoder $E_{\theta}$ maps an input $x$ to a lower-dimensional latent space and then the decoder $D_{\phi}$ learns to reconstruct from this encoded representation. The parameters $\theta$, $\phi$ of the AE are optimized given healthy input data $\chi = \{x_i, ..., x_n\}$ by minimizing the mean squared error (MSE) between the inputs and their reconstructions:
\begin{equation}
MSE = min_{\theta, \phi} \sum_{i=1}^{N} || x_i - D_\phi(E_\theta(x_i))||^2 \enspace .
\label{eq:mse}
\end{equation}

It is then assumed that during inference, the AE will generate a so-called pseudo-healthy reconstruction, in which only in-distribution healthy tissue can be successfully reconstructed and thus any reconstruction errors can be thought of as anomalies. A subject-specific map of anomalies can then be obtained by taking the residual between an input $x$ and its reconstruction $x_{recon}=D_\phi(E_\theta(x))$ as follows:
\begin{equation}
m_{residual} = |x - x_{recon}| \enspace .
\label{eq:m_residual}
\end{equation}

Deformable Autoencoders (AEs)~\cite{bercea2023aes} were proposed as a method to alleviate false positives in the anomaly maps due to the limited reconstruction capabilities of traditional AEs. Since the top layers of the AE contain spatial information, deformable AEs use these layers to estimate a dense deformation field $\boldsymbol{\Phi}$ that allows local adaptions of the pseudo-healthy reconstruction to the individual anatomy of the subject. The estimation of the deformation field is optimized using local normalized cross correlation (LNCC):
\begin{equation}
\mathcal{L}_{morph} = LNCC(x,x_{morph}) + {\beta}||\boldsymbol{\Phi}||^2 \enspace ,
\label{eq:loss_morph}
\end{equation}
where $\beta$ is a weight that is kept relatively high to constrain the deformations to be smooth and local, allowing only small changes to the reconstructions. We therefore refer to this part of the network as the constrained deformer. The improved reconstruction, which we refer to as the morphed reconstruction, $x_{morph}$, can then be obtained by $x_{morph} = x_{recon} \circ \boldsymbol{\Phi}$.

The authors also propose to use perceptual loss (PL)~\cite{percep}, weighted by the hyperparameter $\alpha$, in addition to the MSE when optimizing the AE parameters, to promote reconstructions that closely resemble the training distribution:
\begin{equation}
\mathcal{L}_{recon} = \text{MSE}(x,x_{recon}) + \alpha \text{PL}(x,x_{recon}) \enspace.
\label{eq:recon_PL}
\end{equation}

\section{Methods and Materials}
\label{sec:methods}

\begin{figure*}[t!]
\centering
\includegraphics[width=0.99\textwidth]{figures/fig_arch.png}
\caption{Our approach, \textit{MORPHADE}, integrates a dual-deformation strategy with a 3D autoencoder and adversarial training. The constrained deformer refines the reconstruction to generate a residual map with reduced false positives, while the unconstrained deformer is used to produce a folding map that highlights anomalies. The residual and folding maps together produce an anomaly map that allows the localization and assessment of the severity of atrophy.}
\label{fig1}
\end{figure*}

We propose \textit{MORPHADE}, shown in Fig. \ref{fig1}, which builds upon deformable AEs. Firstly, we employ a 3D convolutional AE to enable the use of 3D images with the framework. Secondly, since PL uses 2D networks pre-trained on ImageNet, we employ an adversarial loss \cite{adversarial} to increase the realness of the reconstructions. We train a discriminator by minimizing this adversarial loss; therefore, the reconstruction loss becomes:
\begin{equation}
\mathcal{L}_{recon} = \text{MSE}(x,x_{recon}) + \gamma \text{Adversarial}(x,x_{recon}) \enspace,
\label{eq:recon_adversarial}
\end{equation}
where ${\gamma}$ balances the production of realistic reconstructions while maintaining pixel-wise accuracy.

Our major extension to the deformable AEs is the use of a dual-deformation strategy, in which we employ an unconstrained deformer in addition to the constrained deformer, with the aim of improving the localization of atrophic regions. As previously stated, the constrained deformer is trained with a high value of $\beta$ to improve the generation of the pseudo-healthy reconstructions and thus reduce false positives in the anomaly maps. In contrast, the unconstrained deformer has the goal of reverting the pseudo-healthy reconstruction back to its original anomalous state. The deformer is trained with the same loss as in Eq. \ref{eq:loss_morph}, but with a low value of $\beta$, which allows the creation of unconstrained deformation fields. In such deformation fields, low values of deformation should occur in areas of healthy tissue. Conversely, in regions of atrophy, the deformation field exhibits foldings, or areas in which the mapping of the deformation from the pseudo-healthy reconstruction to the original image is not one-to-one due to the loss of tissue volume. The determinant of the Jacobian of the deformation map, $J_{\boldsymbol{\Phi}}$, can be used to determine local volume changes, with negative values indicating such foldings. Therefore, we highlight the anomalies by using the negative Jacobian values to generate a map of the foldings, $m_{foldings} = \sigma(\max(0, -\det(J_{\boldsymbol{\Phi}})))$, where $\sigma$ is the Gaussian filtering operation.

We finally multiply these foldings pixel-wise with the residual map from the constrained deformer to generate an anomaly map with reduced false positives and improved atrophy localization:
\begin{equation}
    \text{Anomaly Map} = m_{residual} \times m_{foldings} \enspace.
\end{equation}


\noindent\textbf{Implementation.} All networks are trained with Adam optimizer. The discriminator is trained with a learning rate of \(1.0e^{-4}\), otherwise \(5.0e^{-4}\) is used. 

We carry out training in two phases to obtain first the constrained deformer and subsequently the unconstrained deformer. First, the entire framework is trained for 200 epochs with a high value of $\beta=10$; in this way, we train the constrained deformer, which also encourages the AE to produce sharper reconstructions. We motivate this in Fig. {\ref{fig:beta}}a, where we show that using decreasing values of $\beta$ during training results in blurrier reconstructions. Conversely, a high $\beta$ value ensures that the AE does not overly rely on the deformations to achieve faithful reconstructions, but is instead forced to learn an accurate representation of the in-distribution data. In the second phase, we train the unconstrained deformer by keeping the weights of the AE frozen while optimizing the deformation parameters with a lower value of $\beta=0.01$ for 100 epochs. We demonstrate the need for lower $\beta$ values to produce improved folding maps in Fig. {\ref{fig:beta}}b, where it can be seen that using low values accentuates the anomalous regions in the brain. Finally, at inference, we use the constrained deformer to obtain the residual maps and the unconstrained deformer to generate the folding maps.\\




%\hl{We used two phases for training, first for the constrained and second for the unconstrained deformer. We motivate this choice in Fig. \ref{fig:beta}a, where we show that directly using decreasing values of $\beta$ during training results in blurrier reconstructions. Conversely, a high $\beta$ value ensures that the AE does not overly rely on the deformations to achieve faithful reconstructions, but is instead forced to learn an accurate representation of the in-distribution data. Therefore, in first phase, the framework was first trained with a high value of $\beta=10$ for 200 epochs. Then, in second phase, the weights of AE were kept frozen while the deformation parameters were optimized with a lower $\beta$ for 100 epochs.


%At inference, we use constrained deformer to obtain the residual maps and unconstrained deformer to generate the folding maps. We demonstrate the need for lower $\beta$ values to produce improved folding maps in Fig. \ref{fig:beta}b, where it can be seen that using low values accentuates the anomalous regions in the brain.\\}
%\noindent\textbf{Implementation.} All networks were trained with Adam optimizer. The discriminator was trained with a learning rate of \(1.0e^{-4}\), otherwise \(5.0e^{-4}\) was used. The framework was first trained with a high value of $\beta=10$. We motivate this choice in Fig. \ref{fig:beta}a, where we show that using decreasing values of $\beta$ during training results in blurrier reconstructions. Conversely, a high $\beta$ value ensures that the AE does not overly rely on the deformations to achieve faithful reconstructions, but is instead forced to learn an accurate representation of the in-distribution data. After 200 epochs, the weights of these models were kept frozen while the deformation parameters were optimized for 100 epochs.

%At inference, we use a high value of $\beta=10$ to obtain the residual maps and a low value of $\beta=0.01$ to generate the folding maps. We demonstrate the need for lower $\beta$ values to produce improved folding maps in Fig. \ref{fig:beta}b, where it can be seen that using low values accentuates the anomalous regions in the brain.\\

\begin{figure*}[t!]
\centering
\includegraphics[width=0.6\textwidth]{figures/morphade_effectOfBeta_v2.png}
\caption{a) During the first phase of training, a high value of $\beta=10$ constrains the deformer, promoting the AE to learn to produce less blurry reconstructions. b) During the second phase of training, a lower value of $\beta=0.01$ allows the deformer to be unconstrained. This unconstrained deformer is then used to generate folding maps (here shown overlaid on the input brain) that enhance the identification of anomalies.}
\label{fig:beta}
\end{figure*}


\noindent \textbf{Dataset and Preprocessing.} Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (\url{adni.loni.usc.edu})~\cite{Petersen201}. We use skull-stripped T1-weighted MPRAGE images of both male and female patients that are registered to the MNI brain template~\cite{mni}. Our training set comprises 760 healthy control (HC) samples, with an additional 95 HC samples utilized for validation purposes. For the supervised baseline training, an additional 430 AD samples are used. The test set includes 215 HC samples and 200 samples with AD.