\section{Experimental Setup and Results}


\subsection{Datasets}
We employ two MRI datasets to evaluate domain generalization in segmentation tasks. In a preprocessing step, we homogenize voxel-spacings and pad or crop images to a uniform size within each dataset. For both datasets, we use the domain with the most subjects as the source domain and the remaining domains as target domains. \rev{The exact number of subjects and slices in each domain is listed in Table~\ref{tab:dataset_cases}.}

\subparagraph{Heart MRI.} The second version of the M\&Ms Challenge dataset \cite{mnmv1, mnmv2} consists of 8128 annotated cardiac MRI slices across 360 subjects from sites in different countries, acquired using seven different scanning devices, which we use to define our domains. Each image is annotated with three segmentation classes (left ventricle, right ventricle, and myocardium), resulting in a detailed dataset with good support that is well suited for deep learning applications. In our experiments, we refer to it as the M\&M dataset.

\subparagraph{Prostate MRI.} \cite{pmri} collect T2-weighted MRI scans of the prostate and their respective binary segmentation masks from six institutions spanning three public datasets. Each institution has distinct imaging devices, protocols and field strengths, resulting in a rich collection of domain shifts with relatively small case numbers. In total, the dataset contains 1773 annotated slices across scans from 116 subjects. In our experiments, we refer to it as the PMRI dataset.

\begin{figure}[t]
    \centering
    \includegraphics[width=\textwidth]{figures/unet_eval.png}
    \caption{Segmentation quality in terms of volumetric and surface Dice on training and validation data from the source domain, as well as on different target domains. A massive domain shift is observed in PMRI, a more subtle one in M\&M.}
    \label{fig:unet-eval}
\end{figure}

\subsection{Image Segmentation with U-Nets}
To reduce the impact of confounding factors as much as possible, we use the same U-Net architecture and confidence predictor across datasets and tasks. We use MONAI \cite{monai} to implement a U-Net with 32 initial channels and a depth of four with four residual units per level. Furthermore, we use dropout with a dropout rate of 0.1 per ADN layer to ensure a fair comparison to a previous described indirect confidence estimation technique \cite{score-agreement}. We train the U-Net with a mixed loss of Dice and cross-entropy, using the Adam optimizer with a learning rate of $10^{-3}$ and default settings. We copy the default data augmentations from the nnU-Net \cite{nnunet} and train with learning rate scheduling and early stopping based on a held-out validation set. Albeit following standard strategies to address potential domain shifts, we still observe a moderate drop in performance on M\&M and essentially a complete shift in confidence score distribution for PMRI, see Figure \ref{fig:unet-eval}. In combination with the diverse dataset characteristics and our two confidence scores, we hope to provide a convincing test bed for confidence prediction in medical image segmentation.

\subsection{Confidence Prediction}
The goal of our proposed adversarial perturbation scheme is to improve the confidence predictor's generalization capabilities. To quantify whether this goal has been met, we report Pearson correlation, excess area under the risk-coverage-curve (eAURC) \cite{eAURC} and mean absolute error for volumetric and surface dice scores, specifying mean and standard deviation across five runs.

In Figure \ref{fig:results}, we compare our proposed framework for direct confidence prediction, with and without our adversarial perturbation scheme, to an approximate Bayesian method that we refer to as score agreement \cite{score-agreement}\rev{, and that has been identified as a robust baseline for failure detection in a recent comparative benchmark \cite{zenk2025}}. It is based on taking Monte Carlo samples of segmentation masks with test-time dropout, and measuring the agreement between them by averaging pairwise quality metrics, in our case, volumetric or surface Dice. Following the same setup as the authors, we take $N=15$ samples \rev{to saturate performance}, resulting in 105 pairwise comparisons. This procedure results in a confidence score that correlates to segmentation accuracy, but does not estimate the corresponding quality metric directly. Therefore, it does not make sense to compute mean absolute error for it. 

On most domains, for both datasets and confidence measures, our adversarial perturbation scheme significantly improves the predictor's ability to generalize in terms of achieving higher correlation, lower eAURC, and lower mean absolute error. Our approach also narrows the gap to the computationally much more expensive score agreement methodology, even outperforming it in several cases. \rev{Additionally, we run score agreement with only $N=2$ samples, to make its computational complexity more comparable to direct confidence prediction. In this scenario, direct prediction is mostly superior.}

\rev{We also explored average aggregated predictive entropy over $N=15$ dropout samples as another baseline, but found its results too weak to include them in the Figure. On average over datasets and confidence metrics, it achieved a Pearson correlation below 0.3.}

In addition to the comparisons in Figure \ref{fig:results}, we also investigate benefits of fine-tuning the segmentation network's backbone alongside the confidence predictor but find only marginal improvements, at the cost of increased computational and implementation complexity, see Table \ref{tab:ablation}.

\begin{figure}[t]
    \centering
    \includegraphics[width=\textwidth]{figures/results_all.png}
    \caption{Evaluation of our proposed direct confidence prediction. Adversarial perturbations increase correlation, and decrease excess area under the risk-coverage-curve, as well as mean absolute error, in almost all cases. In several cases, it even provides better correlation and eAURC than the computationally much more expensive score agreement approach, which does not provide absolute predictions. }
    \label{fig:results}
\end{figure}

\subsection{Computation Times}

Table~\ref{tab:times} compares the running times of our proposed approach (top row) to those of computing score agreement. Due to its shallow architecture, our confidence predictor adds negligible overhead to the segmentation network itself. Fine-tuning (second row) roughly doubles our running time, with marginal benefits reported in Table~\ref{tab:ablation}.

In any case, our running times are much shorter than for the 15 forward passes that are required to compute saturated score agreement \rev{and still considerably shorter than score agreement with two forward passes}. Most importantly, inference times in our approach are independent of the computational complexity of the selected quality metric, which by far dominates the overall running time of score agreement.
\begin{table}[t]
    \centering
    \begin{tabular*}{0.78\linewidth}{@{\extracolsep{\fill}}lrr}
        \toprule
        Method & M\&M & PMRI \\
        \midrule
        $f_\theta$ + $C_\phi$ \hfill         & 0.0009 & 0.0020 \\
        \; + $f_\theta$ after fine-tuning    & 0.0018 & 0.0038 \\
        \rev{Volumetric Dice Agreement (N=2)} & 0.0059 & 0.0062 \\
        Volumetric Dice Agreement (N=15) & 0.1011 & 0.0802 \\
        \rev{Surface Dice Agreement (N=2)} & 0.0230 & 0.0153 \\
        Surface Dice Agreement (N=15) & 1.8964 & 1.0683 \\
        \bottomrule
    \end{tabular*}
    \caption{Inference times in seconds per image, averaged across 100 runs on a single NVIDIA A40 GPU. We use MONAI's metric implementations and calculate agreements in a single batch. Our proposed approach (top row) is much faster than score agreement (row two and below), especially with expensive quality metrics such as surface Dice.}
    \label{tab:times}
\end{table}

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../submission"
%%% End:
