
We plot the trend of DSC metric per severity for each transformation considered under image variations.
Trends for DSC for ACDC are shown in Fig.~\ref{fig:all_trends_per_corruption_acdc} and HD95 in Fig.~\ref{fig:hd95_all_trends_per_corruption_acdc}.
Trends for DSC for P158 are shown in Fig.~\ref{fig:all_trends_per_corruption_p158} and HD95 in Fig.~\ref{fig:hd95_all_trends_per_corruption_p158}.

\input{sections/appendix/figure/dsc_corruption_trends_all_acdc}

\input{sections/appendix/figure/hd95_corruption_trends_all_acdc}

\input{sections/appendix/figure/dsc_corruption_trends_all_p158}

\input{sections/appendix/figure/hd95_corruption_trends_all_p158}

We see that while base augmentations are really effective in improving generalisation to some transformations, rotation, scale contrast compression and expansion, and iso-downsample. These are transformations that base augmentations also overlap with. However, more complicated transformations are not completely overcome, for instance, bias field, ghosting, rician noise, k-space subsampling, spike noise, smoothing, aniso-downsampling and random motion.

The augmentations MixUp and AFA also seem to complement each other. Using AFA without MixUp can reduce performance on HD95 metric on some transformations, which can be understood by the fact that regularising frequency components leads to difficulty delineating boundaries (typical high frequency changes) while MixUp does not deteriorate boundary delineation, but it is unable to regularise frequency components leading to performance decline on various variations. However, using both MixUp and AFA in general leads to the best performance for each metric, and this pattern is consistent between both test datasets.
