\subsection{Synthetically Corrupted Images}
\label{sec:robimvar}

\begin{table}[!tb]
\centering\caption{DSC and HD95 on the original and transformed test set of ACDC and P158 using either using no augmentations or a combination of base, MixUp, and AFA augmentations. \blue{Blue} and \red{red} indicates statistically significant ($p < 0.05$) \blue{improved} / \red{reduced} metrics between models without and with MixUp or AFA using paired $t$-test. \textbf{Bold / \blue{Bold}}-faced numbers indicate the best result for each column.}
\input{tables/avg_dataset_results}
\label{tab:results_acquisition}
\end{table}
% We present the results in Table~\ref{tab:results_acquisition}.

We train nnU-Net models with different combinations of base augmentations, MixUp, and AFA, for both the ACDC cardiac cine MR dataset and the P158 prostate dataset. For each dataset, we evaluated results on the original test set and the test set with the transformations described in Sec.~\ref{sec:corruptions}. Note that we do not apply any of these corruptions to the training sets. Tab.~\ref{tab:results_acquisition} reports DSC and HD95 values for this experiment. Fig.~\ref{fig:trend_performance} shows the relation between the severity of individual transformations and DSC values obtained for ten models. 
% applied on the test set and the performance of the models trained with different combinations of augmentation strategies for both ACDC and P158. 



\paragraph{Cardiac cine MRI} For cardiac cine MRI, when not using any data augmentation, there is a large performance gap between the original ACDC test set and the transformed test set, i.e., DSC 0.891 vs. 0.755, indicating poor generalization to out-of-distribution data. We find that adding either MixUp or AFA to this model improves performance on the transformed test set, to DSC 0.760 and 0.801, respectively. Moreover, the combination of both augmentation strategies improves performance further to DSC 0.804. The gain on the transformed test set exceeds the performance drop on the original test set.

A similar pattern unfolds when we combine MixUp and AFA with base augmentations. Here, we see a performance increase in all settings for both the original and transformed data. Notably, base augmentations lead to a DSC of 0.801 on the transformed test set, compared to DSC 0.755 when no augmentation is used. This indicates that these augmentations are able to improve performance on some of the out-of-distribution samples. However, we find that MixUp and AFA can lead to significant $(p < 0.05)$ performance gains on top of these augmentations, up to DSC 0.862 and HD95 7.02 when both are used. Moreover, when MixUp and AFA are used in combination with base augmentations, the performance drop on the original test set is smaller and not significant. 

The results in Fig.~\ref{fig:trend_performance} indicate that adding MixUp and AFA improves robustness to \textit{all} imaging variations. This includes common MRI artifacts such as bias fields, which might be corrected using existing techniques, but also corruptions that can not easily be corrected with pre-processing techniques, such as k-space subsampling, ghosting, spike noise, and Rician noise. We also find that the performance increase is larger at higher severity levels again indicating the significant improvements to robustnes. % Furthermore, also 

\paragraph{Prostate MRI} Our findings in the prostate MRI data set match those found in the cardiac cine MRI data set to a large extent. We find that in the model without any augmentations, there is a performance gap between the original (DSC 0.789) and transformed (DSC 0.705) data sets. Using AFA, either stand-alone or in combination with MixUp, narrows this gap. However, we also find that using \textit{only} MixUp has a detrimental effect on model performance. Similar to the cardiac cine MRI set, we find that MixUp and AFA reduce performance on the original data set when used without base augmentations; however, they only significantly reduce when using both MixUp and AFA. In contrast to cardiac cine MRI, adding MixUp and AFA to base augmentations not only leads to a significant performance increase on the transformed data set but also on the original dataset. Results for prostate MRI in Fig.~\ref{fig:trend_performance} show similar trends as for cardiac cine MRI, confirming the general nature of our findings and added value of MixUp and AFA over base augmentations.

\begin{figure}[!tb]
    \centering
    \input{pgfplots/corruption_trend}
    \caption{Trend of DSC per severity for test sets corrupted with bias field, ghosting, k-space subsampling, Rician noise and spike noise. We notice that MixUp and AFA are effective in mitigating the challenges posed by complex image corruptions.}
    \label{fig:trend_performance}
\end{figure}


% on the original and the transformed data set in all cases, a significant improvement over only using base augmentations.
% In contrast to what we found in cardiac cine MRI, adding MixUp and AFA does not appear to have a detrimental effect on performance on the original test set.





% We analyse the impact of combinations of base augmentations, MixUp and AFA, on feature representations. 

% substantially. 


% \subsection{Robustness to Individual Variations}

% We further observe that the improvement in performance for some transformations is quite large across each severity for both ACDC and P158, as shown in 


% the standard nnU-Net model which uses Bias Field as an augmentation greatly misclassifies the structures while this is reduced with the combination of MixUp alone and in combination with AFA.

% A similar trend was seen for the P158 dataset, with the exception of MixUp without nnU-Net augmentation having a lower mean average dice than not using MixUp for the transformed test set. However this decline was not statistically significant.

% Notably, when nnU-Net augmentations were incorporated during training, significant improvements were observed on transformed test sets and no significant decline in the original test set for both datasets.

% the ACDC dataset achieved a DSC of 0.891 on the original test set and 0.755 on the transformed test set, and for P158 a DSC of 0.789 on the original test and 0.705 of the transformed test set. Augmentations using AFA alone led to significant improvements in DSC on the transformed test set 0.801 and 0.724, respectively, for ACDC and P158, and this is not significantly different from when we have only nnU-Net augmentations enabled (row 5), showing the strength of these augmentations to significantly making the models robust to out-of-distribution samples. 

% MixUp alone or in combination with AFA led to improvements on the DSC, compared to the baseline (0.755). However, it is to be noted that for the ACDC dataset, using AFA alone or in combination with MixUp leads to significant decline on the original test set when not used in conjuntion with nnU-Net augmentations.


% Among these, the highest DSC was observed with both MixUp and AFA (0.804), indicating that the combination of these augmentations can effectively enhance robustness against transformed data. 



% For the P158 dataset, the baseline nnU-Net achieved a DSC of 0.789 on the original test set and 0.705 on the transformed test set. Augmentations using MixUp alone or in combination with AFA resulted in substantial improvements on the transformed test set, with DSCs of 0.724, 0.758, and 0.770 (rows 6, 7, and 8, respectively) compared to the baseline (0.705). When nnU-Net augmentations were included during training, consistent gains were observed across both original and transformed test sets, with the highest DSC on transformed data achieved using MixUp and AFA (0.770). These results demonstrate the benefit of combining nnU-Net augmentations with additional strategies like MixUp and AFA, particularly in improving robustness under data transformations. Similarly, improvements in HD95 metrics were noted, with the lowest value of 6.33 achieved using both MixUp and AFA, further highlighting the effectiveness of these augmentations in handling challenging conditions.
% We first evaluate the robustness of models trained with only the training set of the challenges to the presence of acquisition artefacts in MR data and the added value of a combination of the standard nnU-Net augmentations and MixUp and AFA augmentation strategies. For cardiac cine MR the results in Table~\ref{tab:results_acquisition} indicate that a model trained with MixUp (0.760), AFA (0.801), or both (0.804) performs significantly better than a model trained without either (0.755). For models trained without standard nnU-Net augmentations there is performance decline observed for the original test set, this is often the case due to the added bias of augmentation strategies.
% Moreover, while we find that the nnU-Net with its base options of augmentations add a substantial improvement in performance compared to the model without augmentation (0.801).
% Introducing MixUp or AFA alone improves robustness, but the combination of MixUp and AFA achieves the highest performance on corrupted data, increasing the Dice score from 0.801 to 0.862—a substantial improvement of 6.1 percentage points over the baseline.
% Notably, the augmentations achieve this robustness without sacrificing performance on the original test set, where Dice scores remain nearly identical across configurations.
% The results for the P158 dataset, similarly, in Table~\ref{tab:results_acquisition} indicate that models trained with MixUp (0.758) and AFA (0.760) performs substantially better than model trained without (0.730). When the base augmentations of nnU-Net, AFA and MixUp are combined we get the best performance on the transfer test set (0.770). For the P158 dataset, however, it appears that MixUp results in a drop in performance on the transformed test set when there are no nnU-Net augmentations enabled (0.696 vs 0.705). This can be explained by the fact that MixUp does not introduce any image variations when it combines the input, therefore the variance of the augmented dataset is small. When the images have been augmented by the nnU-Net augmentations, we gain a further performance on the transfer test set. We again notice no substantial changes in the performance of the original test set, in fact AFA results in a slight improvement (0.832 vs 0.825). 


% Use of MixUp performs well against, artefacts like Ghosting and AFA performs substantially well against k-space related artefacts (KSpaceSubsampling, RicianNoise, SpikeNoise) and we see that a combination of the two predominantly outperforms the rest at each severity.


% \begin{figure}[!tb]
%     \centering
%     \includegraphics[width=\linewidth]{figures/corruption_trends/chosen_corruption_trends_acdc_medg.png}
%     \caption{Trend of performance per severity for the transformed ACDC test set.}
%     \label{fig:trend_performance_acdc}
% \end{figure}

% \begin{figure}[!tb]
%     \centering
%     \includegraphics[width=\linewidth]{figures/corruption_trends/chosen_corruption_trends_prostate158_medg.png}
%     \caption{Trend of performance per severity for the transformed P158 test set.}
%     \label{fig:trend_performance_p158}
% \end{figure}
% ability to generalize across different populations and acquisition conditions using the ACDC and M\&Ms datasets.  results obtained on cardiac cine MR segmentation in ACDC using a MedNeXt-L model. These results show that MixUp and AFA augmentations significantly enhance the robustness on transformed datasets while maintaining competitive performance on the original test set. The baseline model without augmentations suffers substantial performance degradation when evaluated on corrupted images (Dice score drops from 0.913 to 0.680). This drop underscores the sensitivity of segmentation models to image transformations, which can mimic real-world variabilities such as scanner noise, motion artifacts, and acquisition inconsistencies.


% ACDC primarily focuses on patients with cardiac conditions acquired in a controlled clinical environment, while M\&Ms spans multiple institutions and vendors, incorporating diverse populations with various pathologies and imaging conditions.

\subsection{Real-World Distribution Shifts}
In our previous experiments, we analysed the robustness of models trained with various augmentation s on out-of-distribution test sets where the distribution shift was controlled. We now consider generalization between datasets with real-world distribution shifts.
% originating from similar yet different distributions.
These may be a result of different demographics, protocols, acquisition parameters like resolution, b-values in prostate bi-parametric MR, scanner vendors, etc. As in our previous experiment, we train with combinations of base, MixUp and AFA augmentation, omitting results trained without base augmentations. Tab.~\ref{tab:results_ds} lists results for cardiac cine MRI and prostate MRI.

\paragraph{Cardiac cine MR} We train segmentation models on the ACDC data set and use M\&Ms as a test set. We find that a model trained on ACDC with only base augmentations experiences a performance drop compared to a model trained on M\&Ms with only base augmentations, with Dice coefficients of 0.870 and 0.882. However, when combinations of MixUp with AFA are added, we find consistent performance improvements across all augmentation combinations. Furthermore, we find that for models that use a combination of base augmentations, MixUp, or both MixUp and AFA, we approach the DSC and HD95 score of the model that was trained on M\&Ms itself, bridging the generalization gap.

\paragraph{Prostate MRI} The segmentation of the prostate bpMRI presented a more significant domain shift challenge. We train the segmentation model on P158 and use the PX dataset as the test set. As noted earlier, the large variability in prostate glands poses a difficult challenge and substantially impacts model performance. Despite this challenge, a model trained with base augmentations, along with MixUp (DSC 0.737, HD95 6.60), is a significant improvement over using only the base augmentations (DSC 0.705, HD95 7.87), indicating improved generalization capabilities in this setting where there is large variance in anatomy. Models in combination with AFA also showcase significantly improved performance.

\begin{table}[!tb]
    \centering
    \caption{DSC and HD95 performance under distribution shift for Cardiac Cine MR, testing on M\&Ms, and Prostate bpMRI, testing on PX, with various data augmentation strategies. \blue{Blue} numbers denote significantly \blue{improved} metrics when models using MixUp or AFA compared to model only using base augmentation $(p < 0.05)$. 
    % Metrics for the best model trained on the test dataset from $\dagger$~\cite{campello_multi-centre_2021} and $+$~\cite{xu_development_2023}.
    }
    % $\dagger$ The best performing model from the original M\&Ms challenge.}
    \label{tab:results_ds}
    \input{tables/avg_metrics_ds}
\end{table}


\subsection{Model Interpretation}
To further analyse why the proposed augmentations outperform the base augmentation in MRI,
%  in-depth justification for why they outperform standard augmentations in MRI.
% To investigate whether our empirical findings reflect characteristics of the trained models under data augmentation, 
we quantify the separability and compactness of their learned features using k-variance gradient-normalized margins (kVGM)~\cite{chuang_measuring_2021}. 
% It is well motivated that better seperation and compactness are linked to better generalisation}. 
A higher value for this metric indicates that the model has learned more separable and compact clusters of representations which in turn is linked to better generalisabilty.
% , while negative values mean that the model misclassifies.
Fig.~\ref{fig:emb_vis} visualizes the position of voxels from the transformed test set in the feature space of ten of our trained models (dimensionality reduced via PCA), along with the kVGM of each model. These plots show that the absence of augmentation leads to poor feature separation, while using only base augmentations leads to better clustering of features that are not easily separable. Adding AFA alone improves separability, and MixUp alone enhances compactness, and when combined, they appear to promote both compactness and separability. In Appendix~\ref{app:more_pca} we show that this behaviour is consistent across many runs.
The kVGM metric, in increasing order for generalisability, ranks the models starting from no augmentations, followed by base augmentations, base with AFA, base with MixUp, and finally base with both AFA and MixUp. 
This supports our finding that these augmentations enrich feature representations, leading to enhanced out-of-distribution generalization.

% Regularisation of the model weights is another key factor that effects model generalisability, where weights with lower norm are preferred as they are less likely to overfit to noise. To this extent, we consider the regularisation effect of these augmentations on the weights of the convolutional kernel of the U-Net backbone.


\begin{figure}[!tb]
\centering
\includegraphics[width=\textwidth]{figures/esoteric/pca_plot_minimised.pdf}
% \input{figures/esoteric/pca_acdc_avg_transformed}

% \input{figures/esoteric/pca_p158_avg_transformed}
\caption{PCA projection of learned features, for final features from nnU-Net trained with different augmentation techniques for samples from the transformed test sets (top: ACDC, bottom: P158) with the corresponding kVGM metric.}
\label{fig:emb_vis}
\end{figure}


% For both datasets, improvements in the HD95 metric generally followed similar trends to those observed in DSC.
% \paragraph{Cardiac Cine MR}
% In Tab.~\ref{tab:results_ds}, we evaluate the generalization performance of models trained on the ACDC dataset and tested on the M\&Ms dataset. 
% % Cross-dataset generalization is a critical metric, as it reflects the model's ability to perform under distribution shift caused by different demographics, protocol, and scanner vendors.
% While the baseline model trained on ACDC achieves a Total Dice score of 0.872 on M\&Ms, augmentations such as MixUp and Auxiliary Fourier Augmentation (AFA) significantly improve generalization, with both MixUp alone and the combined MixUp + AFA approach achieving a Total Dice score of 0.880. These results are not statistically significantly lower than the best model trained on M\&Ms dataset, which achieves a score of 0.883. 
% % This small difference indicates that the use of augmentations successfully mitigates the challenges of domain adaptation, enabling models trained on one dataset to perform almost as well as models trained directly on the target dataset.
% \begin{table}[!tb]
% \centering
% \caption{Performance comparison of models trained on ACDC and M\&Ms datasets. The evaluation is conducted on the M\&Ms dataset, with results stratified into End-Diastole (ED), End-Systole (ES), and Total Dice scores.}
% % \resizebox{\textwidth}{!}{%
% \small
% \begin{tabular}{@{}ccccc|c@{}}
% \toprule
% & \multicolumn{4}{c|}{{\makecell{Trained on ACDC}}} & {\makecell{Trained on M\&Ms}} \\ \midrule
% & {nnU-Net} & {+ AFA} & {+ MixUp} & {\makecell{+ MixUp + AFA}} & {\makecell{Best Model}} \\ \midrule
% ED             & 0.886          & 0.887       & \textbf{0.892} & 0.891         & 0.895 \\  
% ES             & 0.858          & 0.860       & 0.869          & \textbf{0.870} & 0.871 \\ 
% Total          & 0.872          & 0.874       & \textbf{0.880} & \textbf{0.880} & 0.883 \\ \bottomrule
% \end{tabular}%
% % }
% \label{tab:acdc_M\&Ms_results}
% \end{table}
% \paragraph{Prostate bi-parametric MRI}
% For the case of the prostate dataset, we consider the differences in performance of models trained on the Medical Segmentation Decathlon Prostate~\cite{antonelli_medical_2022} dataset when tested on the test set of P158 dataset~\cite{adams_prostate158_2022}.
% % The generalization gap / performance drop observed when transitioning from one dataset to another—highlights the challenges of domain adaptation for an automatic segmentation model. For example, a deep neural network trained on ACDC may perform well on similar data but struggle on M\&Ms due to differences in scanner vendors, contrast settings, or patient demographics. By analyzing this gap, we identify specific aspects of the data distribution that limit model's applicability. 
% % The results of the prostate segmentation task, as presented in Tab.~\ref{tab:msd_p158_results}, highlight the challenges of generalization due to the unique characteristics of the prostate gland and its anatomical variations. 
% % Unlike the cardiac datasets, where segmentation is relatively standardized, prostate segmentation is known to exhibit significant inter-rater variability due to the gland's large variations in shape, size, and texture, as noted in prior studies \cite{adams_prostate158_2022}. These variations make accurate and consistent annotations more difficult, resulting in a higher degree of noise and ambiguity in the ground truth labels, which can limit model performance.
% % This variability is reflected in
% The lower performance of models trained on the MSD dataset, with average DSC of 0.653 on the P158 dataset for the model trained only with base augmentation. However, the results also demonstrate that combining base augmentation with alone AFA or MixUp, can statistically significantly improve OOD-domain generalization. Models augmented with AFA achieve an average DSC of 0.706, representing a significant improvement over just the base augmentations model. Similarly, MixUp alone matches this performance, and the combination of MixUp + AFA also achieves similar results (0.695).
% % Despite these improvements, the best model trained on the P158 dataset achieves higher performance, with a Total Dice score of 0.833 and superior scores in both the PZ (0.784) and TZ (0.883). This gap reflects the limitations of training on a source dataset (MSD), it is a very small dataset of only healthy patients, while the test set of P158 has patients with PCa lesions. 
% %Nevertheless, the augmentations close this gap over the base options by almost 5\%, highlighting their effectiveness in enhancing segmentation performance under challenging conditions. These findings emphasize the need for augmentation strategies and dataset-specific tuning when addressing tasks with high inter-rater variability and substantial anatomical diversity.
% \begin{table}[!tb]
% \centering
% \caption{Performance comparison of models trained on Medical Segmentation Decathlon (Prostate) and P158 datasets. The evaluation is conducted on the P158 dataset, with results stratified into Peripheral Zone (PZ), Transition Zone (TZ), and Total Dice scores.}
% % \resizebox{\textwidth}{!}{%
% \small
% \begin{tabular}{@{}ccccc|c@{}}
% \toprule
% & \multicolumn{4}{c|}{{\makecell{Trained on MSD}}} & {\makecell{Trained on P158}} \\ \midrule
% & {nnU-Net} & {+ AFA} & {+ MixUp} & {\makecell{+ MixUp  + AFA}} & {\makecell{Best Model}} \\ \midrule
% PZ             & 0.555          & \textbf{0.605}       & 0.595 & 0.570         & 0.771 \\  
% TZ             & 0.751          & 0.808       & 0.816          & \textbf{0.819} & 0.881 \\ 
% Total          & 0.653          & \textbf{0.706}       & \textbf{0.706} & 0.695 & 0.826 \\ \bottomrule
% \end{tabular}%
% % }
% \label{tab:msd_p158_results}
% \end{table}