\documentclass{midl} % Include author names
%\documentclass[anon]{midl} % Anonymized submission
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{algorithmic}
\usepackage{graphicx}
\usepackage{textcomp}
\usepackage{eqnarray}
\usepackage{ulem}
\usepackage{framed,multirow, multicol}
\usepackage{bm}
\usepackage{graphicx}
\usepackage{verbatim}

\setlength{\textfloatsep}{2.2pt plus 2.0pt minus 2.0pt}

\usepackage{titlesec}

\titlespacing\section{0pt}{12pt plus 4pt minus 2pt}{7pt plus 4pt minus 2pt}

%footnotes on one line
\usepackage[para]{footmisc}

% for several reference to the same footnote
\makeatletter
\newcommand\footnoteref[1]{\protected@xdef\@thefnmark{\ref{#1}}\@footnotemark}
\makeatother


% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution

\usepackage{mwe} % to get dummy images

\jmlryear{2024}
\jmlrworkshop{Full Paper -- MIDL 2024}
\jmlrvolume{-- nnn}
\editors{Accepted for publication at MIDL 2024}

\title[An unexpected confounder: how brain shape can be used to classify MRI scans ?]{An unexpected confounder: how brain shape can be used to classify MRI scans ?}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship


\midlauthor{\Name{Valentine {Wargnier-Dauchelle}} \Email{valentine.wargnier@creatis.insa-lyon.fr}\\
\Name{Thomas Grenier} \Email{thomas.grenier@creatis.insa-lyon.fr}\\
\Name{Michaël Sdika} \Email{michael.sdika@creatis.insa-lyon.fr}\\
\addr INSA Lyon, Universite Claude Bernard Lyon 1, CNRS, Inserm, CREATIS UMR 5220, U1294, Lyon, France}


\begin{document}

\maketitle

\begin{abstract}
Although deep learning has proved its effectiveness in the analysis of medical images, its great ability to extract complex features makes it susceptible to base its decision on spurious confounders present in the images. However, especially for medical applications, network decisions must be based on relevant elements. Numerous confounding factors have been identified in the case of brain scans such as gender, age, MRI sites or scanners, etc. Nevertheless, although skull stripping is a classic preprocessing step for brain scans, brain shape has never been considered as a possible confounder. In this work, we show that brain shape is used in the classification of brain MRI scans from different databases, even when it should not be considered as a clinically relevant factor. To this purpose, we introduce a rigorous two steps method to assess whether a factor is a confounder or not, and we apply it to identify the brain shape as a confounding variable in brain images classification. Lastly, we propose to use a deformable registration in the data preprocessing pipeline to align the brain contours of the images in the datasets, whereas standard pipelines often do nothing more than affine registration. Including this deformable registration step makes the classification free from the brain shape confounding effect.
\end{abstract}

\begin{keywords}
Confounding factor, Classification, Brain shape, Deformable registration, Interpretability
\end{keywords}



\section{Introduction}

Deep learning has emerged as a powerful tool in the field of medical imaging. 
Its ability to automatically learn and extract complex patterns from vast amounts of data has revolutionized the way we analyze images.
However, the great performances of deep learning come with the price of the black-box nature of these methods: deep neural networks, with their non-linearity and their large number of parameters, are difficult to explain.
Training explainable and interpretable networks is therefore a key issue for medical image analysis as the lack of transparency can hide the fact that the network decision may be based on wrong reasons:
a bias in the training set can make a confounding factor plays an important role in the decision.  
Classifiers are especially subject to this problem whether they are used for pure classification problems or as guidance for adversarial networks or diffusion models. 

Several tricks can be used to reduce or remove the influence of a known confounding factor from a model.
The simplest is to carefully collect the training dataset such that the confounding variables are matched in the different classes as in \cite{leming2022construction}. 
However, this approach makes it tedious to create large datasets.
If possible, normalization preprocessing can also be used to discard variations of this variable across the dataset. For example, in \cite{wargnier2021more}, the MRI signature is removed using brain tissue probability maps instead of MRI scans as input of a deep classifier. 
Data augmentation can also be used to make the model invariant to a set of variables or transforms.
In the literature, works have also been proposed to train models free from the influence of a known confounder.
In \cite{zhao2020training}, the features of a network are trained for the prediction of the objective task but also
trained adversarially for the prediction of the gender, considered here as the confounding variable, making the features invariant to this confounder. 
In \cite{wang2018removing}, the model is first trained for the objective task and the top layer is then fine-tuned 
to predict the confounding variable: the gender, the subject or the contrast material. 
During this fine-tuning step, weights sensitive to the confounding factor are identified and discarded. 
For these methods to be used, the potential confounders need to be identified first.
Moreover, for the two latter methods, the potential confounding factor should be available as a scalar of categorical data and the value of the confounding variable should be known during training. 
To identify confounding factors in images, attribution maps can be used to localize the information used by the network to make its decision. For example, in \cite{sun2023right}, an attribution maps comparison protocol was 
proposed by visualizing confounding factors artificially added to the images. 
In \cite{wargnier2021more,wargnier2023weakly}, these maps are used to validate and/or improve the interpretability of the network by verifying that the decision is based on brain lesions.

In this work, we assess the importance of the brain shape as a confounding factor in the classification of brain images. 
To do so, we propose a rigorous protocol to verify that the brain shape is indeed a confounding factor used as a part of the network decision despite the standard affine spatial normalization. 
The first step of our protocol is to verify if it is possible to classify various datasets using the confounding factor only: for the concrete case we investigate, we propose to use the brain mask as a brain shape representation and our first non-intuitive result is that it is possible to classify the datasets using only this mask.
Then, we check that the brain shape is indeed part of the decision when original grayscale images are used as input of the network.
To this end, we modify the image such that the identified factor could be from the opposite class and evaluate the impact on the classification task.
We come up with two solutions to change the brain shape from one class to another: by trimming the borders of the brain or with deformable registration.
Finally, we show that by complementing the standard affine registration of the preprocessing with a deformable registration to normalize the brain shape,
 we can classify brain images while canceling out the confounding effect of brain shape in classification.

\section{Method}

\subsection{A generic two steps confounding factor identification procedure}

Our procedure to assess that a variable is indeed a confounding factor is in two steps.
In the first step, we verify that the suspicious variable is a potential confounding factor
by checking whether or not it is possible to classify the data using this variable only.
To do so, a classifier, having this variable only as input, is trained to classify the subjects using the same class label as the original problem.
If the classifier is random, with an accuracy close to 0.5, the variable can be discarded from the potential 
confounders of the problem.

If the classification is possible in the first step, we check in the second step 
whether this factor is indeed used by a model trained with the original images as input.
To do so, the model is trained conventionally on the original images. 
Then, test data are transformed such that the value of the suspicious factor lies within the distribution of this factor for the opposite class, while modifying the image as little as possible.
The difference in classification performances between the original and the transformed test data is then measured.
A lower classification performance for transformed data would indicate that the identified factor is used by the model to correctly classify and is indeed a confounder.

\subsection{Identifying the brain shape as a confounding factor}
\label{sec:identconfounder}

For the concrete case of the brain shape, the first step is achieved by trying to classify 
several brain datasets using the brain masks only as input of the network. This binary mask, indicating
whether a voxel is inside the brain (value of 1) or outside (value of 0), is used as a representation of the brain shape.
For the second step, we need a transform to make the brain shape of a subject
match the shape of subjects in the opposite class, while 
changing the image content as little as possible. Two such transforms are investigated: 
brain mask crop or brain mask registration.

\paragraph{Brain mask crop} A brain mask is randomly drawn from the opposite class and used to crop the grayscale image of the current subject: pixels outside the mask are set to the background value. 
This cropped image is given as input to the classification model at test time. 
The image remains the same inside the brain mask but it is changed at its border. 
Note that this technique modifies only a part of the shape: the part outside the mask drawn from the opposite class.

\paragraph{Brain mask registration}
\label{sec:reg}
A brain mask is also randomly drawn from the opposite class but this time, it is used as a reference to realign the brain shape of the current subject.
To realign a moving brain mask $B_m$ on a reference brain mask $B_r$, we solve the following optimization problem:
\begin{equation}
 \label{eq:regshape}   
\min_{T \text{ s.t. } J(x)\ge t} \sum_{x \in \partial B_r } d(T(x),\partial B_m) + \lambda \sum_x || \Delta T (x) ||^2,
\end{equation}
 where $T$ is the transformation we look for, $\partial B_m$ and $\partial B_r$ are the border of the two brain masks and $d(.,\partial B_m)$ is the Euclidean distance to the border of the moving brain mask.
 To penalize strong deformations, a bending energy term with a coefficient $\lambda$ is added and 
 the Jacobian of the transformation $J(x)$ is constrained to be higher than a given threshold $t$ for all voxels. 
 Any registration algorithm could be used with a null image as the fixed image, 
 the distance transform of $\partial B_m$ as the moving image and a cost function masked 
with $\partial B_r$. 
The transformation $T$ is then applied to the original grayscale image to obtain an image with a brain shape of the opposite class.

\subsection{Eliminating the brain shape confounding effect with normalization}
\label{sec:regnorm}
Affine registration to a reference template is usually included in the data preprocessing
to spatially normalize the datasets.
We advocate that this affine registration step is not sufficient to avoid the brain shape confounding effect in deep learning.
We propose to add a deformable registration step to normalize the brain shape.
To do so, we extract the brain mask of all subjects as well as of the reference template, and realign each subject's brain mask to the reference template brain mask by solving the registration problem of Equation \ref{eq:regshape}.
The computed deformable transforms are then applied to the corresponding images to create shape-normalized brain datasets that can be used to train any network.

\section{Experiments}
\label{sec:exp}


\subsection{Data}
\label{ssec:data}

Seven T1w MRI datasets are used in our experiments: 
the five public healthy databases IXI\footnote{\href{https://brain-development.org/ixi-dataset}{brain-development.org/ixi-dataset}}, 
HCP\footnote{\href{https://www.humanconnectome.org/study/hcp-young-adult}{humanconnectome.org/study/hcp-young-adult}} 
\cite{babayan2019mind}, kirby \cite{landman2011multi}, MPI \cite{babayan2019mind} 
and IBC \cite{pinho2018individual},
the OFSEP/EDMUS multiple sclerosis (MS) dataset\footnote{\href{https://www.ofsep.org}{ofsep.org}} from the ``Observatoire français de la sclérose en plaques",  the MS french registry~\cite{vukusic2020ofsep,confavreux1992edmus}, 
the MICCAI BraTS 2020  glial tumors public dataset \cite{bakas2017advancing, bakas2018identifying, menze2014multimodal} that also includes the manual tumors segmentation, and the Alzheimer's disease (AD) ADNI-1 dataset\footnote{\href{https://adni.loni.usc.edu}{adni.loni.usc.edu}} \cite{weiner2010adni} which also includes healthy subjects (CN).
Division in training, validation and test sets is given in Table~\ref{t:data}.

 \begin{table}[t]
     \centering
     \small
     \caption{T1 MRI datasets. H refers to healthy, MS to multiple sclerosis, T to tumors and AD to Alzheimer's Disease}.
     \label{t:data}
     \begin{tabular}{|c|c|c|c|c|c|c|c|c|c|}
     \hline
     Dataset     &     IXI     &          HCP &       MPI   &    kirby   &    IBC     &     OFSEP   &   BraTS    & \multicolumn{2}{c|}{ADNI}       \\ 
                 &             &              &             &            &            &             &            &    CN    & AD       \\ \hline
     $N_{train}$ &     400     &          500 &          64 &         22 &          8 &         383 &       280  & 183      & 150      \\ \hline
     $N_{val}$   &     130     &          100 &          15 &          \phantom{0}5 &          2 &          \phantom{0}97 &        \phantom{0}40  & \phantom{0}23       & \phantom{0}19       \\ \hline
     $N_{test}$  &      \phantom{0}50      &          500 &          15 &          \phantom{0}5 &          2 &          \phantom{0}30 &        \phantom{0}49  & \phantom{0}23       & \phantom{0}19       \\ \hline
     Status      &     H       &            H &          H  &          H &          H &          MS &         T  & H        & AD       \\ \hline
      Age        & $50 \pm 17$ &   $29 \pm 4$ & $31 \pm 14$ & $31 \pm 7$ & $34 \pm 5$ & $43 \pm 12$ & $60 \pm 9$ & $76\pm5$ & $75\pm8$ \\\hline
    \multicolumn{10}{c}{}
\end{tabular}
 \end{table}


\subsection{Experimental protocol}
\label{ssec:proto}

MR images are preprocessed using FSL FLIRT affine registration on the T1 MNI atlas \cite{jenkinson2001global, jenkinson2002improved}, HD-BET brain extraction \cite{isensee2019automated} and N4 bias field correction \cite{tustison2010n4itk} except for the BraTS dataset. As this dataset is provided preprocessed using the CaPTk pipeline\footnote{\href{https://cbica.github.io/CaPTk/preprocessing_brats.html}{cbica.github.io/CaPTk/preprocessing\_brats.html}} that includes the brain extraction (different from HD-BET), we only applied the affine registration and the bias field correction.
The final image size is $91\times109\times91$ with a 2mm voxel size.
Binary classifiers are trained to classify the brain MRI datasets (either healthy, with multiple sclerosis or with tumor subjects).
In the following, ``shape normalized" datasets refer to the datasets normalized using the procedure of Section \ref{sec:regnorm}.
We also evaluated the impact of elastic deformations data augmentation during training (denoted as ``Elastic DA"). 
The deformations were chosen to be strong enough to hide the differences between brain shapes of the different datasets.
Classification performances are evaluated using 
the true positive/negative rate (TPR/TNR) and the balanced accuracy (BA).
In Section \ref{sec:bmaskclassify}, we analyze the feasibility of distinguishing the datasets
using only the confounding factor that-is-to-say using only the brain binary masks. 
In Section \ref{sec:bshapeisused}, we evaluate if the brain shape is indeed a confounding factor for classification models trained with MRI input\footnote{\label{noteHH}Healthy vs healthy datasets results are in the supplementary material for space consideration}. 

\subsection{Implementation details}
\label{ssec:claissifier}

The classifier, implemented in Pytorch, is a 3D  PatchGan~\cite{isola2017image}, trained with the Adadelta optimizer~\cite{zeiler2012adadelta}, class balanced minibatches, an initial learning rate set to 1 and the cross entropy loss.
This CNN is defined as $C64$-$C128$-$C256$-$C512$ where $Ck$ denotes a Convolution-BatchNorm-LeakyReLU (slope $0.2$) layer with $k$ filters, except for the first layer on which no BatchNorm is applied. 
At the end, a convolution is applied to obtain a 1-dimensional output.
In Section \ref{sec:res:attrib}, attributions were computed using gradient maps \cite{simonyan2013deep}.
The brain shape registration algorithm of Section \ref{sec:reg}
is based on the algorithm described in \cite{sdika2008tmi}. We set $t=0.85$, $\lambda=10^{-3}$ and the transformation is represented by a B-Spline vector field with a node spacing of 4 voxels.  



\section{Results}

\subsection{Attributions highlight brain borders}
\label{sec:res:attrib}

\begin{figure}[t]
    \centering
    \subfigure[IXI         \label{f:att_i}   ]{ \includegraphics[width=0.22\textwidth, trim=15mm 0mm 15mm 10mm, clip=true]{Figures/ixiB_att.png} }      
      \subfigure[BraTS     \label{f:att_b}   ]{ \includegraphics[width=0.22\textwidth, trim=15mm 0mm 15mm 10mm, clip=true]{Figures/brats_att.png} }     
    \subfigure[IXI   norm. \label{f:att_ireg}]{ \includegraphics[width=0.22\textwidth, trim=15mm 0mm 15mm 10mm, clip=true]{Figures/ixiregB_att.png} }
    \subfigure[BraTS norm. \label{f:att_breg}]{ \includegraphics[width=0.22\textwidth, trim=15mm 0mm 15mm 10mm, clip=true]{Figures/bratsreg_att.png} }     
    \caption{IXI vs BraTS classification gradient attributions. From left to right: for an IXI image, for a BraTS image, for a shape normalized IXI image
    and BraTS image.
    The tumor is in yellow,  negative attributions in blue and positive ones in red.}
        \label{f:att}
\end{figure}

To visualize the confounding factors, we use attribution maps which indicate the relevance of each voxel in the network decision.
Figure \ref{f:att} shows some attribution examples for the IXI vs BraTS classification.
In Figures \ref{f:att_i} and \ref{f:att_b}, attributions are focused at the top and bottom of the brain, near the borders. Especially, high attributions (in absolute value) are in the brain stem. Yet, we might expect the areas inside the brain to be the most useful for decision-making. 
Although these attributions indicate that the brain shape can be involved in the network decision,
they are not sufficient alone to draw a definitive conclusion.

\subsection{Brain masks can be used to classify datasets}
\label{sec:bmaskclassify}


We apply the first step of our method as described in Section \ref{sec:identconfounder}: 
binary classifiers are trained to classify several datasets using the brain masks only as input. 
As shown in Table \ref{t:acc_mask}, the first startling result is that, except for the intra-dataset ADNI task, it is possible to distinguish all pairs of datasets based on the brain shape only, without using any tissue or texture information. 
Brain masks from the tumors dataset BraTS can be classified from the healthy dataset IXI with a perfect classification score. Note that this classification performance could be partly explained by the difference between the BraTS preprocessing pipeline and the pipeline used for the other datasets.
Classifying MS brain masks from the healthy ones is more difficult but it is still possible with a balanced accuracy of 62\%. 
Thus, it is possible to classify healthy vs pathological subjects using only the brain shape. 
One might wonder to what extent the difference in brain shape is due to the pathology itself or to some differences due to the way datasets are built.
To investigate the dataset construction effect, we consider the brain mask classification task between several healthy datasets and within the ADNI dataset.
For example, we obtain a perfect accuracy for the IXI vs HCP problem. Even when several databases are aggregated in one class, the distinction is always possible with an accuracy higher than 85\% for IXI vs IBC/kirby/MPI. 
Population age in the databases is another element that can influence brain shapes (with normal aging brain atrophy). 
However, despite the subjects of HCP and IBC/kirby/MPI match in age, it is possible to classify these datasets almost perfectly. 
In the intra-ADNI experiment, classes are from the same dataset and match in age but one is healthy and the other pathological. One can see that, despite the possible atrophy due to AD, brain masks are more difficult to distinguish with a balanced accuracy as low as 55\%. This reinforces the idea that in general, the way datasets are built is a stronger factor than the disease itself for the brain mask classification.
Thus, it is possible to distinguish various databases based on the brain shape only and the difference seems to be linked to a dataset construction effect that is not eliminated when disease or age factors are not present.

\begin{table}[t]
    \small
    \centering
    \caption{Classification accuracy on brain masks. The left dataset is the negative class.}
    \begin{tabular}{|ccc|c|c|c|}
    \hline
         \multicolumn{3}{|c|}{Classification task}  & TNR  & TPR  & BA          \\ \hline
         IXI &vs& BraTS                             & 1.00 & 1.00 & 1.00\\ \hline
         IXI &vs& OFSEP                             & 0.54 & 0.70 & 0.62\\ \hline
         IXI &vs& HCP                               & 1.00 & 1.00 & 1.00\\ \hline
         IXI &vs& IBC/kirby/MPI                     & 1.00 & 0.86 & 0.93\\ \hline
         HCP &vs& IBC/kirby/MPI                     & 0.99 & 0.95 & 0.97 \\ \hline\hline
ADNI-CN &vs& ADNI-AD              & 0.35 & 0.74 & 0.55 \\ \hline
    \multicolumn{6}{c}{}
    \end{tabular}
    \label{t:acc_mask}
\end{table}


\subsection{Brain shape is part of the decision}
\label{sec:bshapeisused}


\begin{figure}[h]
    \centering
    \subfigure[Original         \label{f:b_orig}   ]{ \includegraphics[width=0.23\textwidth, trim=155mm 5mm 155mm 30mm, clip=true]{Figures/brats_orig.png} }      
      \subfigure[Cropped     \label{f:b_crop}   ]{ \includegraphics[width=0.23\textwidth,trim=155mm 5mm 155mm 30mm, clip=true]{Figures/brats_crop.png} }     
    \subfigure[Registered \label{f:b_reg}]{ \includegraphics[width=0.23\textwidth, trim=155mm 5mm 155mm 30mm, clip=true]{Figures/brats_reg.png} }
\\
    \subfigure[Original  norm.       \label{f:breg_orig}   ]{ \includegraphics[width=0.23\textwidth, trim=155mm 5mm 155mm 30mm, clip=true]{Figures/bratsreg_orig.png} }      
      \subfigure[Cropped norm.    \label{f:breg_crop}   ]{ \includegraphics[width=0.23\textwidth,trim=155mm 5mm 155mm 30mm, clip=true]{Figures/bratsreg_crop.png} }     
    \subfigure[Registered norm. \label{f:breg_reg}]{ \includegraphics[width=0.23\textwidth, trim=155mm 5mm 155mm 30mm, clip=true]{Figures/bratsreg_reg.png} }
    \caption{Example of BraTS image (with tumor in yellow) with a random IXI brain mask (in red) and the corresponding modified images (cropped or registered). First (resp. second) line is without (resp. with)  shape normalization. On this case, the original image without shape normalization is classified as pathological, modified images are classified as healthy. }
        \label{f:im_brats}
\end{figure}

In this part, we applied the second step of the process described in Section \ref{sec:identconfounder}:
for each subject, a brain mask is randomly drawn from the opposite class 
and used to crop the current image (``Cropped images") or realigned its brain shape with a deformable registration (``Registered images").
An image for which the crop or the registration changes the predicted class, as well as the corresponding modified images, are shown in Figure \ref{f:im_brats}. We can see that the brain shapes of the two classes are different. With the crop, the shape is partially modified: for example, the area around the stem or the frontal lobe do not change, whereas the registration allows to better fit the shape. Most of the tumor area is left untouched with the crop and it is probably not this loss of information that changes the classification. Indeed, on average in the test set (without shape normalization), only $3.6 \% \pm 5.0\%$  of the tumor is cropped ($3.3 \% \pm 5.5\%$ for the images still classified as pathological and $5.6 \% \pm 3.3\%$ for the others). The classification results are presented 
in Figure  \ref{f:class_im}.
For tumors, without elastic data augmentation, the classification performances on the original images are perfect. Conversely, when the shape is modified the accuracy of both classes falls. The TNR is lower than 50\% for the cropped images and it falls to 16\% for the registered images. Thus, the brain shape seems to be a key factor learned by the network to classify the images.
For MS, the impact is not as strong but still present with a mean loss of accuracy of 3 points for cropped images and 16 points for registered images. This is consistent with the fact that brain masks are harder to distinguish for the MS dataset as shown in Section \ref{sec:bmaskclassify}.
When elastic data augmentation is used, the classification is more difficult on the original images for the tumors dataset but the decision seems less based on the brain shape as the accuracy decreases less when the modified images are tested: the accuracy is in average 27 points lower for cropped images and 7 points lower for registered images. 
This data augmentation improves the robustness, especially for MS as the classification is slightly better on the original images. In this case, the brain deformation seems to have hardly any impact with only 3-point accuracy difference on the registered images.  

\begin{figure}[t]
    \centering
    \subfigure[IXI vs BraTS\label{f:brats_acc}]{\includegraphics[width=0.49\textwidth,  trim=0mm 0mm 0mm 0mm, clip=true]{Figures/brats_acc_lab.png}}
    \subfigure[IXI vs OFSEP    \label{f:ofs_acc}]{\includegraphics[width=0.49\textwidth, trim=0mm 0mm 0mm 0mm, clip=true]{Figures/ofs_acc_lab.png}}
    \caption{True positive/negative rate (TPR/TNR) for IXI vs BraTS (left) or OFSEP (right) classification.
    Bar plots are grouped depending on whether brain shape normalization and elastic data augmentation are used or not. Colors refer to whether original images (light blue), cropped images (middle blue) or registered images  (dark blue) are used at test time.}
    \label{f:class_im}
\end{figure}




\subsection{Deformable registration removes the confounding factor}

The previous experiments validate that the brain shape is a part of the decision for MRI classification.
We advocate that if the images are realigned not only with an affine registration 
but also such that the brain shapes are realigned to the reference template, the network decision can be free from the brain shape confounder. 
Figures \ref{f:im_brats} and  \ref{f:class_im} present the results of the same experiments as in Section \ref{sec:bshapeisused} but using images with the shape normalization of Section \ref{sec:regnorm}. 
Visually, with the shape normalization, there is virtually no longer difference between the shapes.
In terms of classification, we obtain similar or better performances than without normalization for both tumors and MS.
When the modified images are used at the inference, the impact is minor, with an accuracy loss of around 5 points for tumors cropped images and around 1 point for tumors registered images. This is inferior to the model using images without shape normalization, even with elastic data augmentation.
For MS, the results are equivalent or slightly better than without shape normalization and with elastic data augmentation. 
With the shape normalization, elastic data augmentation seems less useful. 
Moreover, in Figures \ref{f:att_ireg} and \ref{f:att_breg}, attributions are no longer localized on the borders but all over the brain. 
Therefore, the shape normalization seems to be enough to eliminate the confounding effect of the brain shape.

\section{Conclusion}

In this work, we propose a generic method to assess whether a variable is a confounding factor or not.
We apply the proposed protocol to several public MRI datasets to identify the brain shape as a non-intuitive confounder for brain scans classification.
In addition, we proposed to add a non-rigid brain shape realignment in the preprocessing pipeline to eliminate the confounding effect of the brain shape. As this step does not degrade the classification performances, our recommendation is to systematically use it (even when the brain shape is not a confounder) in addition to the affine registration conventionally used in standard pipelines.
The elements highlighted in this paper could also be used in state-of-the-art methods like \cite{zhao2020training, wang2018removing}, which so far have only been applied to solve the problem for scalar confounding variables. For this, the confounding variable predicted in these methods would be the brain mask through a segmentation loss. 


\midlacknowledgments{This work was supported by the LABEX PRIMES (ANR-11-LABX-0063) of Université de Lyon, within the program ``Investissements d'Avenir" operated by the French National Research Agency (ANR) and by the ”Projet Emergence” APIDIFF, CNRS-INS2I. We acknowledge the ``Observatoire Français de la Sclérose en plaques" (OFSEP) for providing the data collected with ANR-10-COHO-002. This work was performed using HPC resources from GENCI-IDRIS (AD011012544/AD011012589).
}


\bibliography{midl24_259}

\newpage
 \appendix

\section{Visual results on MS}


\begin{figure}[h]
    \centering
    \subfigure[IXI         \label{f:att_io}   ]{ \includegraphics[width=0.22\textwidth, trim=15mm 0mm 15mm 10mm, clip=true]{Figures/ixiO_att.png} }      
      \subfigure[OFSEP     \label{f:att_o}   ]{ \includegraphics[width=0.22\textwidth, trim=15mm 0mm 15mm 10mm, clip=true]{Figures/ofs_att.png} }     
    \subfigure[IXI   norm. \label{f:att_ioreg}]{ \includegraphics[width=0.22\textwidth, trim=15mm 0mm 15mm 10mm, clip=true]{Figures/ixiregO_att.png} }
    \subfigure[OFSEP norm. \label{f:att_oreg}]{ \includegraphics[width=0.22\textwidth, trim=15mm 0mm 15mm 10mm, clip=true]{Figures/ofsreg_att.png} }     
    \caption{IXI vs OFSEP classification gradient attributions. From left to right: for an IXI image, for an OFSEP image, for a shape normalized IXI image
    and OFSEP image. 
    Negative attributions are in blue and positive ones in red. }
        \label{f:attofs}
\end{figure}


\begin{figure}[h]
    \centering
    \subfigure[Original         \label{f:o_orig}   ]{ \includegraphics[width=0.23\textwidth, trim=145mm 0mm 145mm 20mm, clip=true]{Figures/ofs_orig.png} }      
      \subfigure[Cropped     \label{f:o_crop}   ]{ \includegraphics[width=0.23\textwidth,trim=145mm 0mm 145mm 20mm, clip=true]{Figures/ofs_crop.png} }     
    \subfigure[Registered \label{f:o_reg}]{ \includegraphics[width=0.23\textwidth, trim=145mm 0mm 145mm 20mm, clip=true]{Figures/ofs_reg.png} }
\\
    \subfigure[Original  norm.       \label{f:oreg_orig}   ]{ \includegraphics[width=0.23\textwidth, trim=145mm 0mm 145mm 20mm, clip=true]{Figures/ofsreg_orig.png} }      
      \subfigure[Cropped norm.    \label{f:oreg_crop}   ]{ \includegraphics[width=0.23\textwidth,trim=145mm 0mm 145mm 20mm, clip=true]{Figures/ofsreg_crop.png} }     
    \subfigure[Registered norm. \label{f:oreg_reg}]{ \includegraphics[width=0.23\textwidth, trim=145mm 0mm 145mm 20mm, clip=true]{Figures/ofsreg_reg.png} }
    \caption{Example of OFSEP image with a random IXI brain mask (in red) and the corresponding modified images (cropped or registered). First (resp. second) line is without (resp. with) shape normalization. In this case, the original and cropped images without shape normalization are classified as pathological, registered images are classified as healthy. }
        \label{f:im_ofs}
\end{figure}

In Figures \ref{f:attofs} and \ref{f:im_ofs}, we display attributions maps and shape modifications as in Figures \ref{f:att} and \ref{f:im_brats}, but for multiple sclerosis. The brain shape influence is less visible on attributions for multiple sclerosis than for tumors which is in accordance with numerical results of Sections \ref{sec:bmaskclassify} and \ref{sec:bshapeisused}. Indeed, high attributions are localized all over the brain. Using the brain shape normalized dataset changes the attributions localization: the decision seems less focused on the occipital lobe and the cerebellum, and more around the ventricles. 
The brain shape difference between the two datasets appears located at the back of the skull which is in line with the attributions without shape normalization. The shape normalization is as efficient than on the tumors dataset, as the shapes of the two databases match better.

\section{Quantitative results on brain shape normalization}


\begin{table}[h]
    \centering
        \caption{Average brain volume (in voxels) with and without shape normalization.}
    \begin{tabular}{|c|r@{ $\pm$ }l|r@{ $\pm$ }l|}
    \hline
        Dataset & \multicolumn{2}{c|}{Without normalization} & \multicolumn{2}{c|}{With normalization}\\
        \hline
       IXI &  \hspace{0.5cm}226695 &           15393 & \hspace{0.4cm}233847 &           3816 \\
     OFSEP &                223919 &           15816 &               233553 &           4745 \\
     BraTS &                232094 & \phantom{0}5792 &               234053 & \phantom{0}527 \\
     HCP & \hspace{0.5cm}226818 &           14623 & \hspace{0.4cm}234016 &           4116 \\
      IBC & \hspace{0.5cm}234565 &           11071 & \hspace{0.4cm}235009 &           \phantom{0}624 \\
      kirby & \hspace{0.5cm}223192 &           10383 & \hspace{0.4cm}234477 &           \phantom{0}483 \\
     MPI & \hspace{0.5cm}219067 &           \phantom{0}3165 & \hspace{0.4cm}234044 &           \phantom{0}470 \\
    \hline
    \end{tabular}
    \label{t:bvol}
\end{table}

In Table \ref{t:bvol}, we compared the mean brain volume between the datasets with and without the shape normalization proposed in Section \ref{sec:regnorm}. We can see that there is much less variability between the datasets and within the same database with the normalization, as desired.

\section{Second step on the healthy databases classification}

\begin{figure}[h]
    \centering
    \subfigure[IXI vs HCP\label{f:ixihcp_acc}]{\includegraphics[width=0.49\textwidth,  trim=0mm 0mm 0mm 0mm, clip=true]{Figures/ixi_hcp_acc.png}}
     \subfigure[HCP vs IBC/kirby/MPI    \label{f:hcpfl_acc}]{\includegraphics[width=0.49\textwidth, trim=0mm 0mm 0mm 0mm, clip=true]{Figures/hcp_fl_acc.png}}
    \subfigure[IXI vs IBC/kirby/MPI    \label{f:ixifl_acc}]{\includegraphics[width=0.49\textwidth, trim=0mm 0mm 0mm 0mm, clip=true]{Figures/ixi_fl_acc.png}}
    \caption{True positive/negative rate (TPR/TNR) for three healthy vs healthy classification.
    Bar plots are grouped depending on whether brain shape normalization and elastic data augmentation are used or not. Colors refer to whether original images (light blue), cropped images (middle blue) or registered images  (dark blue) are used at test time.}
    \label{f:class_im_H}
\end{figure}

In Figure \ref{f:class_im_H}, the classification results on the grayscale images for several healthy vs healthy databases classification are presented. The results show that, for the IXI vs HCP and the HCP vs IBC/kirby/MPI classifications, the brain is a confounder used by the network to make its decision. Indeed, when a shape transform is applied at test time, the classification performances fall.
For the IXI vs HCP classification, the brain shape is no longer a confounder with either elastic data augmentation or the proposed shape normalization.
For the HCP vs IBC/kirby/MPI classification, our brain shape normalization eliminates the brain shape confounder more efficiently than the data augmentation with a 5-point accuracy gain for the Cropped images.
For the IXI vs IBC/kirby/MPI classification, the brain shape does not seem to be part of the network decision as the accuracy falls only slightly when the shape transformations are applied.
Note however that even then, the shape normalization removes this slight accuracy decrease.
As even when the brain shape is not (or barely) used as a confounder, the normalization does not degrade the performances and as the brain shape could be used to distinguish the databases (as shown with the first step in Section \ref{sec:bmaskclassify}) in a different setup, the brain shape normalization seems to be a be a good step to add in a preprocessing pipeline.

\end{document}
