\section{SDM Results and Discussions}
\begin{figure}[tb]
    \centering
    \includegraphics[width=1.0\linewidth]{figures/result_sdm_reb.pdf}
    \caption{\textbf{SDM results.} (a) Proposed criteria evaluated over 5 random runs with varying failure-attribute proportions. (b) Case A. Top: the caliper vs. non-caliper performance gap is evident at prevalence 5\% and 10\% for non-caliper but negligible at 80\%; Bottom: performance-pass/purity-pass shows no sensitivity to error levels, whereas the proposed criteria do. (c) Case C results with different embeddings. Statistical significance is assessed using independent two-sample t-tests.}
    \label{fig:result_sdm}
\end{figure}


\paragraph{SEG4SEG is able to discover problematic slices in segmentation.}
Fig.~\ref{fig:result_sdm}a shows that our SDM effectively identifies performance-relevant problematic slices. In Cases A, C, and D, our SDM achieves high detection rates even at a prevalence as low as 5\%. Case B with central cropping exhibits lower detection at low prevalence, likely because SEG4SEG operates at the sample level, whereas pixel–position shortcuts may require patch- or pixel-level grouping for precise identification.


\paragraph{Purity and performance disparity alone do not capture slicing success.}
Fig.~\ref{fig:result_sdm}b illustrates that neither purity nor performance disparity, whether considered independently or jointly, suffice to determine whether a slice is meaningfully identified. We define performance pass if $\omega^2 > 0.14$\footnote{Values for $\omega^2$ 0.14 indicates large effects.~\citep{field2024discovering}} and purity pass if $\text{Purity}(a_f) > 0.8$. 
However, even when caliper/non-caliper difference is negligible, both perf. pass and purity pass remain high, meaning that  
the low-performing slice is not necessarily the one driving the purity signal. This highlights a mismatch between these metrics and the actual reliability of slice discovery.


\paragraph{The proposed evaluation criteria reflect whether a failure attribute affects model performance.}
Fig.~\ref{fig:result_sdm}b shows that the proposed criteria increase their pass rates as the performance gap between caliper and non-caliper samples grows, indicating that the criteria align with the actual impact of the failure attribute.





\paragraph{Segmentation-specific embedding choices are critical for failure detection.}
Case~C illustrates the necessity of incorporating embedding variants (Sec.~\ref{sec:representation_design}) as the error stems from annotation quality, not image appearance. As shown in Fig.~\ref{fig:result_sdm}c, using default embeddings (original images + Dice) fails to detect the degraded-annotation samples. Detection improves when the image embedding incorporates annotation masks, and improves further when replacing Dice with PPV, which is more sensitive to pixel-level false positives.





\begin{figure}[tb]
    \centering
    \includegraphics[width=1.0\linewidth]{figures/wild_data.pdf}
    \caption{\textbf{SEG4SEG uncovers failure modes in real-world datasets}: (a) annotation style in skin lesion segmentation, (b) low image quality in retinal imaging. Performance disparities are shown left, SDM results for different $\gamma$ are shown right. Statistical significance is assessed using independent two-sample t-tests.}
    \label{fig:real_world_cases}
\end{figure}


\begin{figure}[tb]
    \centering
    \includegraphics[width=0.76\linewidth]{figures/fives_samples.pdf}
    \caption{\textbf{Exemplary SDM result on the original FIVES dataset.} 
Center: Dice scores for the 10 clusters and the proportion of low-quality images in each cluster. 
Left and right: samples from selected clusters. The SDM identifies low-performance slices that contain a high proportion of low-quality images. }

    \label{fig:fives_samples}
\end{figure}


\paragraph{SEG4SEG also works on real-world datasets.}
Applying our SDM to the original ISIC2018 (Case B) and FIVES datasets (Case D) successfully identifies problematic samples arising from annotation style inconsistencies and low image quality, respectively. As shown in Fig.~\ref{fig:real_world_cases}, the SDM highlights these failure modes by successfully passing the criteria, and the significant differences between the subgroups confirm the existence of these failures.
Fig.~\ref{fig:fives_samples} provides a detailed example of an SDM result for the low-quality retinal images. As illustrated in the center plot of Fig.~\ref{fig:fives_samples}, clusters with poorer performance exhibit a higher proportion of low-quality samples. Visualizing representative samples from each cluster further supports this trend: clusters with higher performance predominantly contain high-quality images, whereas cluster~0, which performs the worst, consists entirely of low-quality images. The SDM is also capable of capturing other meta-attributes: for example, one cluster groups together images with similar visual patterns, 93\% of which correspond to cases without diagnosed eye disease.





\section{Conclusion}
In this work, we investigated whether SDMs can be applied to discovering failures modes in segmentation tasks. 
We have proposed a taxonomy of segmentation failure modes, selected representative test cases, and validated an adapted SDM (SEG4SEG) across four controlled-error settings and two real-world settings.
SEG4SEG operates by clustering test samples based on their representations in an unsupervised manner, identifying slices where performance systematically degrades.
Our results demonstrate that SEG4SEG is effective in revealing diverse failure modes in medical image segmentation and shows strong potential as a tool for systematic failure analysis.


\section*{Acknowledgment}
NW, AF and SB were partially funded by DTU Compute, the Technical University of Denmark; the Pioneer Centre for AI (DNRF grant nr P1); and the Novo Nordisk Foundation through the Center for Basic Machine Learning Research in Life Science (MLLS, grant NNF20OC0062606). 
This work was conducted during NW’s external research stay, with the stay partially supported by Otto Mønsted foundation, IDAs og Berg-Nielsens Studie-og støttefond and travel scholarship from DTU.
LMK and AB were supported by the Diabetes Center Berne. 
SS was supported by Deutsche Forschungsgemeinschaft (DFG) – EXC number 2064/1 – Project number 390727645, the Carl Zeiss Foundation in the project ``Certification and Foundations of Safe Machine Learning Systems in Healthcare".
The funding agencies had no influence on the writing of the manuscript nor on the decision to submit it for publication. 
