
% This   may occur, where patients can have different pre-existing conditions and pathologies. Third, differences or artefacts in image acquisition protocols might result in images for the same patient having different characteristics.  


\subsection{Data}
We investigate to what extent data augmentation strategies can mitigate the effects of distribution shifts in MRI. For each of cardiac cine MRI and bi-parametric prostate MRI we perform two tests. First, to test the effect of real-world distribution shifts, we include two separate datasets, allowing us to consider differences within and between datasets. Second, as it is hard to quantify a generalisation gap between real world datasets due to various MR image characteristics, we modify the test set using controlled MRI transformations allowing us to isolate failures to specific MR image variations.

\paragraph{Cardiac cine MR}
We use the Automated Cardiac Diagnosis Challenge (ACDC)~\cite{bernard_deep_2018}, which contains 150 cardiac cine MRI scans (100 training, 50 test) acquired at Hospital of Dijon, France (in-plane resolution 1.37-1.67mm, slice thickness 5-10mm). 
% Patients are equally distributed over five categories: normal, myocardial infarction (MI), dilated cardiomyopathy (DCM), hypertrophic cardiomyopathy (HCM), and right ventricular abnormality (RV). 
As an external test set used to measure generalization performance, we include the M\&Ms~\cite{campello_multi-centre_2021} test set of 268 scans with different pathologies from multiple centers (in-plane resolution 0.85-1.45mm, slice thickness 10mm). In both ACDC and M\&Ms, manual annotations of the left ventricle (LV), myocardium (MYO), and right ventricle (RV) are provided in end-diastolic (ED) and end-systolic (ES) frames. 

% To measure the generalization gap, we use the models trained on ACDC and evaluate their performance on the M\&Ms~\cite{campello_multi-centre_2021} test set,  
% Inclusion in M\&Ms follows the same standard operating procedure as ACDC and has similar imaging characteristics. Moreover, the set includes data from multiple centers (four from Spain, one from Canada, and one from Germany). This allows us to investigate the generalization of models to differences between centers, which might manifest in patient characteristics, image characteristics due to different vendors, as well as imaging protocols. In both ACDC and M\&Ms, the task is to segment the left ventricular cavity, the myocardium, and the right ventricular cavity.
% The temporal resolution is of 28-40 frames per cardiac cycle.
% It includes patients with various cardiac conditions, including healthy controls, as well as individuals with  with 30 for each pathology and healthy control.
% The MRI scans are acquired using CINE MRI with 3D short-axis sequences at 
% for investigating the performance of deep learning methods under various image variations that might occur at the acquisition stage.
% The ACDC dataset consists of 
% Finally, pathologies and pre-existing conditions add another layer of complexity. Conditions like cardiac hypertrophy, myocardial infarction, or ventricular dilation create abnormal morphologies that deviate from healthy anatomical templates. Similarly, prostate conditions such as benign prostatic hyperplasia (BPH) or prostate cancer significantly alter gland shape and tissue texture. 
% Variations are particularly challenging for segmentation models trained primarily on datasets with limited diversity, for instance, pathology, imaging protocol for high-resolution scans with thinner slices~\cite{huang_impact_2021}, or different scanner vendors. These are differences that are hard to replicate using common augmentation techniques.
% The task is the same, segmentation of left and right ventricle and myocardium.
% Patient demographics play a critical role in introducing variability as many anatomical characteristics can be attributed to regions (for instance, ethnicity) and regional norms (for instance, diet).
% Factors such as age, sex, and body habitus influence the size, shape, and signal properties of anatomical structures. 
% At hospitals and clinics, many different patient preparation protocols or different scanners might be used. 
% Acquisition protocols introduce additional variability, as parameters such as slice thickness, image orientation, echo time, and repetition time are tailored to specific diagnostic needs. 
% Due to M\&Ms dataset acquiring scans from multiple centres, multiple machines and multiple pathologies, it provides us with a tool to measure the generalisation gap of deep neural networks that have been trained with data with low variations.

\paragraph{Prostate MRI}
We include prostate bi-parametric MRI (bpMRI) scans from the Prostate 158 (P158) dataset~\cite{adams_prostate158_2022}, which has 139 scans for training and 19 scans for testing. 
% This data set contains patients with prostate cancer (PCa) lesions. 
The DWI images were acquired at b-values ranging from 50 to 1000 $s/mm^3$ and a high b-value of 1400 $s/mm^3$. In addition, as an external test set to measure generalization, we use the ProstateX (PX) dataset~\cite{armato_prostatex_2018}, which includes 141 test scans where DWI scans were acquired with 3 b-values (50, 400, and 800 $s/mm^2$), and a computed apparent diffusion coefficient map. Both datasets provide T2w scans at an in-plane resolution of 1.45-1.5 mm, ADC maps at 0.45-0.5 mm, and slice thicknesses of 3-4mm. Segmentation masks of the transitional zone (TZ) and the prostate peripheral zone (PZ) are available in all images. The test masks for PX are taken from~\cite{xu_development_2023}. Both T2w and ADC are used for model training and evaluation.

% , which only has healthy patients, and evaluate their performance on the test set of P158.

% We which provides .
% to investigate

% The dataset provides 
% P158 showcases a high level diversity as it contains scans with prostate cancer (PCa) lesions. Therefore, to measure the generalisation gap, 
% Scans for both datasets are acquired at different centres and cases from MSD consist of only healthy patients while cases from P158 consists of scans from both healthy patients and patients with prostate cancer (PCa) lesions.
% These differences help us study the effects of variation on deep neural networks trained with only healthy patients when presented with scans with PCa lesions.
% As these scans are also acquired at different centres, we expect more variability due to differences in protocols and demographics.
% For instance differences in contrast-agent administration protocols further contribute to inconsistencies, particularly in imaging dynamic tissues like the heart or tumours in prostate scans.



% , as evidenced by findings in datasets like ACDC and M\&Ms ~\cite{CITE}.

%In recent years there have been many large-scale datasets that have been developed to develop better and more performative automated segmentation of the prostate gland, temporal zone, and also cancerous lesions~\cite{CITE}. 
% When datasets are collated over different centres, differences in MRI acquisition settings across studies further exacerbate this variability, creating a significant hurdle for achieving robust generalization. These factors underline the necessity for augmentation strategies that can accommodate a wider range of real-world variations.

% \subsection{Demographic Differences}

\subsection{Modelling Image Variation}
\label{sec:corruptions}
\begin{figure}[!tb]
    \centering
    \includegraphics[width=.95\linewidth]{figures/example_images/example_image_grid.pdf}
    \caption{Corruptions (severity level 3) in cardiac cine MRI images and the T2w channel of prostate bpMRI images as a result of our image variation model.}
    % These variations serve as a compelling example of challenges posed by variations in imaging for medical data analysis.}
    \label{fig:example_variations}
\end{figure}%
% In medical imaging, segmentation models must operate reliably across diverse real-world scenarios, where natural variations in data are inevitable. These variations arise from differences in patient anatomy, scanner types, imaging protocols, and environmental factors during acquisition. However, relying solely on highly curated datasets may not fully capture the breadth and severity of conditions that models might encounter in practice, as these datasets are intentionally `clean' data that represent the best possible scans in most cases. 

% To address this, w

% We simulate image variation in a controlled manner by applying a range of transformations to images.
We define five distinct severity levels for
each transformation. Severity levels capture various magnitudes of distribution shifts, from mild
to severe. As there is a lack of quantitative studies 
investigating the range of severity of variations in medical imaging data, we followed the same approach as ROOD-MRI~\cite{boone_rood-mri_2023} to determine the generation parameters based on input from experienced MRI technicians. The parameters used to define the severity levels are detailed in our
program code and may need to be adapted for other datasets.
% Such transformations enable us to create reproducible test conditions, allowing for . %   to factors like noise, motion artifacts, intensity variations, and other acquisition-related corruptions.
We apply elastic deformation, isotropic downsampling, anisotropic downsampling, bias field amplification, contrast compression, contrast expansion, ghosting, random motion, Rician noise addition, smoothing, rotation, scaling, spike noise artifacts (radio frequency noise) and k-space subsampling, again at five different levels (1: mild, 5: severe). In total, we apply 14 transformations to each image (see Fig.~\ref{fig:example_variations} for examples). 
It is important to note that these transformations are only applied to the test set and are never seen during the training process. This allows us to systematically study the fragility of these models to unseen MR image variations. This is applied to the test sets of ACDC and P158.

% . ROOD-MRI defines these artifacts with a 5-step increasing severity, where higher severity means a prominent effect of these artefacts. 

% While following methodology outlined in ROOD-MRI, we additionally include sto test robustness against frequency distortions as part of possible image variations. 
% Fig.~\ref{fig:example_variations} shows examples of a subset of these corruptions applied to a test sample in the ACDC and P158 dataset. 

% These systematic evaluations are essential for ensuring that segmentation models perform robustly and generalize well in diverse clinical settings, and further dictate policy of augmentations that are effective in combating domain generalisation.

% Hospital and machine differences are a major source of variability, as institutions employ MRI scanners from different manufacturers (e.g., Siemens, GE, Philips), each with unique hardware and software characteristics. These differences extend to magnetic field strengths (e.g., 1.5T, 3T), which impact signal-to-noise ratios and tissue contrast ~\cite{CITE}. Furthermore, calibration settings and maintenance schedules vary between machines, leading to subtle but impactful differences in imaging outputs~\cite{CITE}.
% We aim to evaluate the robustness of segmentation models to a wide array of transformations that simulate real-world conditions and challenges encountered in medical imaging. We follow the methodology as described in ROOD-MRI~\cite{CITE} and make further additions to the transformations that would be relevant to our discussion. The transformations include both conventional corruptions that appear during MRI acquisition and other perturbations that influence image properties or acquisition characteristics. 

% \begin{multicols}{2}
% \begin{enumerate}
%     \item \textbf{Bias Field}: Introduces low-frequency intensity inhomogeneities, mimicking field non-uniformities in MRI scanners.  
%     \item \textbf{Contrast Compression}: Reduces the dynamic range of intensities, making tissue boundaries less distinguishable.  
%     \item \textbf{Contrast Expansion}: Enhances the intensity range, exaggerating differences between tissues.  
%     \item \textbf{Elastic Deformation}: Simulates geometric distortions, akin to organ deformation during scanning or due to physiological motion.  
%     \item \textbf{Ghosting}: Mimics aliasing artifacts caused by patient movement or hardware issues.
%     \item \textbf{Rigid Random Motion}: Introduces random spatial displacements to simulate patient movement during the scan using k-space perturbation.
%     \item \textbf{Rician Noise}: Adds Rician-distributed noise to the magnitude image, common in low signal-to-noise scenarios or when a high resolution scan is taken
%     \item \textbf{Smoothing}: Mimics taking a centre scan of the image.
%     \item \textbf{Rotation}: Rotates the image by a random angle, testing the model’s invariance to orientation changes.
%     \item \textbf{Scale}: Resizes the image to simulate variations in scanner settings or differences in organ sizes.
%     \item \textbf{Spike Noise}: Adds radio frequency noise to the samples.
%     \item \textbf{K Space Subsampling}: Simulates undersampled k-space data, resulting in lower-resolution reconstructions.  
% \end{enumerate}
% \end{multicols}

% For each transformation, the model is tested across a range of severity levels to capture its robustness profile. 

\subsection{Augmentation Strategies}
We conduct experiments in which we train models with three different kinds of augmentation strategies within the nnU-Net framework ~\cite{isensee_nnu-net_2021}. In all experiments, we keep the pre-processing and post-processing fixed to the default nnU-Net options.
% , which uses the convolutional U-Net~\cite{ronneberger_u-net_2015} as the backbone architecture.

\paragraph{Base augmentations} By default, nnU-Net employs eight augmentation strategies, namely rotation, scaling, Gaussian noise injection, Gaussian blurring, brightness and contrast adjustments, simulation of low-resolution imaging, gamma correction, and mirroring.


% In this work, we compare individually and in combination three augmentation strategies: standard nnU-Net augmentations, MixUp, and Auxiliary Fourier Augmentation (AFA).
% \paragraph{nnU-Net augmentations} A standard set of nnU-Net augmentations include  We do not modify this set when we use these augmentations in combination with other augmentations. Note that these augmentations significantly overlap with the set of image variations during MR acquisition, allowing us to measure the potency of simply replicating the variations as part of the augmentations in improving performance.

\paragraph{MixUp} \label{sec:MixUp}
% ~\cite{zhang_MixUp_2018} 
In this strategy, new samples are generated through linear interpolation of pairs of training samples. 
% Here a sample consists of input $x_i$ and label $y_i$.
Formally, given two samples \((x_i, y_i)\), \((x_j, y_j)\) and \(\lambda \in [0, 1]\) drawn from a Beta distribution, MixUp creates a new synthetic sample as: 
\[
x_{\text{mix}} = \lambda x_i + (1 - \lambda) x_j, \quad y_{\text{mix}} = \lambda y_i + (1 - \lambda) y_j.
\]
% This technique encourages models to learn smoother decision boundaries by exposing them to interpolated data points~\cite{zhang_MixUp_2018}. 
In the original MixUp formulation for classification tasks, images and one-hot encoded labels are both linearly interpolated. Here, we use the exact same strategy, but instead of linearly interpolating between two labels, we interpolate between two one-hot encoded segmentation masks. The loss is then computed using these probability masks as ground truth.
% We adapt the original MixUp setup, in which $y_i$ is a one-hot encoded sample label, to segmentation by linearly interpolating one-hot encoded segmentation masks.
% While previous works have shown MixUp improving classification and segmentation performance on test datasets, we expect the regularisation effect of MixUp~\cite{zhang_how_2020} would improve out-of-distribution generalization in medical image analysis as well.
% The interpolated images (\(x_{\text{mix}}\)) provide a continuous spectrum of anatomical and imaging features, while the interpolated masks (\(y_{\text{mix}}\)) combine label probabilities from the two original masks. 

% Our findings show that MixUp improves model performance on segmentation tasks on MRI for difficult images, as a simple yet effective augmentation strategy which requires little configuration.

\paragraph{Auxiliary Fourier Augmentation}\label{sec:afa} (AFA)
% ~\cite{vaish_fourier-basis_2024} 
augments images in the frequency domain under the hypothesis that visual augmentation techniques are unable to cover the vulnerability of neural networks to perturbations in the frequency domain~\cite{vaish_fourier-basis_2024}. AFA samples frequency basis functions and adds them to the training samples, leaving the label unchanged. Formally, let \(\mathcal{F}\) denote the Real Fourier transform operator. For a training sample \((x_i, y_i)\), the \(n\)-dimensional Fourier transform of \(x_i\) is given by \(X_i = \mathcal{F}(x_i)\). For a fixed mean, $\mu$, the Fourier spectrum is perturbed by \(\alpha\)$~\sim \text{Exp}(\mu)$, a real value, at a randomly chosen frequency coordinate \((k_1, k_2, \ldots)\) in the Fourier domain, modifying \(X_i\) as:
\[
X_i^{\text{aug}}(k_1, k_2, \ldots) = X_i(k_1, k_2, \ldots) + \alpha.
\]
The augmented image in the spatial domain, \(x_i^{\text{aug}}\), is then obtained by applying the inverse Fourier transform: \(x_i^{\text{aug}} = \mathcal{F}^{-1}(X_i^{\text{aug}})\). The model training involves a joint optimization of an AFA-augmented image and a non-AFA-augmented image.

% This augmentation strategy was introduced for a general computer vision task, however, t

% Our findings demonstrate AFA's ability to improve model performance under challenging conditions, including unseen artefacts and dataset shifts.
% \subsection{Performance under Corruptions}

\subsection{Quantitative Evaluation}
We segment all structures separately, namely LV, MYO, RV in cardiac cine MRI, and TZ and PZ in prostate MRI. Results are reported as average  Dice Similarity Coefficients (DSC) and $95^{\text{th}}$ percentile Hausdorff Distances (HD95) (as implemented in MONAI~\cite{consortium_monai_2024}), over all structures, and frames (ED, ES) in cine MRI. 
For all settings, we perform a 5-fold cross-validation. All predictions are made using an ensemble of five models, which is the default and recommended method to use nnU-Net. To test for statistical significance, we use a paired t-test at $p < 0.05$, on the individual metrics calculated for each sample during testing before ensemble averaging. Structure-wise results are shown in Appendix~\ref{app:struct}.

% To this extent, we also measure the generalisability of our model using k-Variance gradient-normalised margins~\cite{chuang_measuring_2021} that measures these two key facets.

% F nnU-Net uses . We use nnU-Net for all experiments to be consistent with our data pre- and post-processing technique, to allow a fair comparison in between models.
% For our experiments, we utilize two state-of-the-art model architectures: MedNeXt~\cite{roy_mednext_2023} and the nnU-Net framework~\cite{isensee_nnu-net_2021}. MedNeXt is a convolutional neural network designed specifically for medical image analysis, leveraging an efficient architecture to handle the complexities of high-resolution medical data. nnU-Net framework also currently uses the MedNeXt backbone, and is known for its dynamic configuration of hyperparameters and preprocessing tailored to different datasets.
% We choose to perform our evaluation both with using the nnU-Net base options of data preparation, augmentation and post processing and without to truly gauge the benefits of the considered augmentation strategies.
% These architectures serve as strong baselines, enabling us to effectively evaluate the impact of MixUp and Auxiliary Fourier Augmentation on segmentation performance, robustness, and generalization.
% \subsection{Quantitative Evaluation}
% We consult the work Metrics Reloaded~\cite{maier-hein_metrics_2024} as a decision guide which recommends we use the

% The performance is primarily measured using the Dice Similarity Coefficient (DSC), which quantifies the overlap between predicted and ground truth segmentations.

% Importantly, we categorize images into diagnostically relevant and non-relevant subsets based on distortion severity, reporting performance separately for each category. This ensures that the analysis reflects the model's utility in clinically applicable scenarios. This holistic approach enables a comprehensive understanding of model performance under diverse and clinically relevant transformations.