
\section{Experimental Design}

\begin{table}[t]
\centering

\caption{Overview of failure modes and corresponding datasets.}
\label{tab:cases}
% \scalebox{0.8}{
\resizebox{\textwidth}{!}{
\begin{tabular}{@{}c l l l l@{}}
\toprule
\textbf{Case} & \textbf{Failure Mode} & \textbf{Real-world Case} & \textbf{Dataset} & \textbf{Failure Attribute ($a_f$)} \\
\midrule
A & Shortcut (sample level) 
  & Calipers in ultrasound 
  & HC18 
  & Non-caliper \\

B & Shortcut (pixel level) 
  & Central cropping in skin lesions 
  & ISIC2018 
  & Non-central masking \\


C & Annotation style 
  & Boundary style in skin lesions 
  & ISIC2018 
  & Polygon-style annotation \\

D & Difficult cases 
  & Low-quality retinal images 
  & FIVES 
  & Low quality \\
\bottomrule
\end{tabular}
}
\end{table}

\begin{figure}[tb]
    \centering
    \includegraphics[width=0.85\linewidth]{figures/illu_allcases.pdf}
    \caption{
    Illustration of failure modes across four experimental cases. 
    Green backgrounds indicate samples that dominate the dataset, 
    while red backgrounds highlight samples with systematic errors that we aim to identify and slice out using SDM methods.
    For demonstration purposes, we show the same sample with and without manipulation to amplify differences; 
    in actual experiments, manipulated and unmanipulated versions of the same sample are never included in the same run. The added calipers in Case A are exaggerated for visual clarity.
    }
    \label{fig:illu_allcases}
\end{figure}

As shown in Tab.~\ref{tab:cases}, we select four representative cases from the failure mode taxonomy defined in Sec.~\ref{sec:taxonomy}, introduce the corresponding errors artificially -- which are demostrated in Fig.~\ref{fig:illu_allcases} -- and validate whether the SDM can reveal them. We also evaluate the framework on two real-world datasets that naturally contain annotation style inconsistencies (ISIC) and image quality variations (FIVES). 


\subsection{Experiment Set-up and Dataset Choices}\label{sec:case_abcd}




\paragraph{Case A -- Sample-level Shortcuts: Calipers in Ultrasounds.} 
We apply our SEG4SEG to detect the caliper shortcut documented by \citet{lin2024shortcut}.
We use the HC18 dataset~\citep{van2018automated} and introduce different error levels by generating artificial calipers from supplied segmentations. 







\paragraph{Case B -- Pixel-level Shortcuts: Central Cropping in Skin Lesions.}
We extend the analysis of \citet{lin2024shortcut} on ISIC2018 by introducing synthetic cropping to reduce foreground-border correlations. 
For each cropping level $n$, we randomly select the top $n\%$ of samples, apply random crops at half the original image size, and discard crops that contain no lesion pixels. 






\paragraph{Case C -- Annotation Styles: Skin Lesion.}
Three distinct annotation styles are observed in ISIC2018: \textit{flood-fill}, \textit{jagged}, and \textit{polygon}~\citep{zepf2023label}. We conduct two experiments:  (1) similar to Cases A and B, we introduce systematic errors by degrading \textit{flood-fill} annotations into \textit{polygon}-like ones at different levels, and (2) we use the original dataset with manually labeled annotation styles.


\paragraph{Case D -- Difficult Cases: Low Image Quality in Retinal Imaging.}
FIVES is a retinal dataset annotated with three types of image quality degradation. Following Case C's design, we evaluate the SDM under two setups: (1) synthetic degradation by darkening ($\beta$=0.5 intensity scaling) and blurring (Gaussian $\theta$=2.0), and (2) the original dataset with existing quality annotations.



\subsection{Implementation and Evaluation Details}


\paragraph{Evaluation pipeline.}
Our evaluation follows a two-stage pipeline that simulates realistic failure discovery scenarios:
(1) \textbf{Model training:} Train segmentation models on datasets with injected systematic errors 
(see Sec.~\ref{sec:case_abcd} and Appendix~\ref{sec:appendix_dataset} for error types and injection procedures).
(2) \textbf{SDM-based discovery:} Apply SDM to the \textit{test set} using the trained model predictions, 
original images, and ground-truth masks as inputs.
Crucially, failure attribute labels are \textit{not} provided to the SDM during slice discovery.
Detection performance is then evaluated by comparing discovered slices against the known failure attributes.






\paragraph{Segmentation model training.}
All segmentation models for all cases are implemented using \texttt{segmentation\_models\_pytorch}~\citep{Iakubovskii:2019}
with a U-Net architecture and a ResNet-34 encoder,
taking 3-channel RGB inputs.
Models are trained for 50 epochs using Adam ($1\times10^{-4}$, batch size 32) with Binary Cross Entropy loss.
Images are resized to $512\times512$ and augmented during training with random rotation, horizontal and vertical flips; all augmentations are disabled for validation and test splits.
We use either the original train/val/test partitions provided with each dataset 
or a 64/16/20 split when original splits are unavailable (see Appendix~\ref{sec:appendix_dataset}).
Model selection is based on validation set performance.

\paragraph{SDM configuration.}
We perform 5 random runs for each experiment to estimate the \textit{detection rate}, 
defined as the proportion of runs in which the problematic slice is successfully identified.
Failure attributes are injected at 5\%, 10\%, and 20\% prevalence. 
We applied CLIP to encode input images into 512-dimensional embeddings, and we applied UMAP for dimension reduction from the CLIP output, with output dimension of 8, the number of clusters are set as 10 for case A, C and D and 20 for case B.
For representation extraction, we use CLIP to encode input images, 
followed by UMAP for dimensionality reduction to 8 dimensions.
The number of clusters is set to 10 for Cases A, C, and D, and 20 for Case B.
We sweep the weighting parameter $\gamma \in [10^{-3}, 10^{3}]$ (Eq.~\ref{eq:gmm}) 
and find that $\gamma = 10$ performs best across all settings; 
we report results using this value unless stated otherwise.
