\section{Introduction}


Understanding why machine learning models fail in high-stakes domains such as medical imaging can reveal important root causes of model failures.
As a motivating example, consider the case of pneumothorax classification from chest X-rays, for which \citet{larrazabal2020gender} documented a significant gender performance gap.
An explanation -- and, thus, a potential solution -- of this phenomenon remained elusive, however.
Initial hypotheses centered on underrepresentation and biological factors (e.g., breast shadows), yet both were systematically ruled out~\citep{weng2023sex}.
Finally, \citet{olesen2024slicing} applied Slice Discovery Methods (SDMs) to investigate this problem, revealing that the model had learned a shortcut based on the presence of chest drains in the images.
Chest drain prevalence differs systematically between genders, thereby linking the shortcut to the previously unexplained gender performance disparity and, finally, providing a root cause explanation.\footnote{The pneumothorax--chest drain shortcut had been documented earlier, but the connection to gender performance gaps was unknown~\citep{oakden2020hidden,jimenez2023detecting}.}

Slice discovery methods (SDMs) are unsupervised or semi-supervised clustering methods that aim to identify semantically coherent clusters (slices) of input data that differ in model performance, thereby aiding the discovery of model failure modes~\citep{eyuboglu2022domino, bissoto2025subgroup,olesen2024slicing}.
SDMs have been predominantly applied in the realm of image classification, and a successful application to segmentation tasks has not been demonstrated.
Similar to classification models, segmentation models have also been shown to suffer from systematic subgroup performance differences, however.
\citet{PuyolAnton2022} demonstrated racial bias in DL-based cine CMR segmentation, \citet{Li2024} showed demographic performance gaps in SAM-based abdominal organ CT segmentation, \citet{Ioannou2022} found significant sex and racial performance gaps in brain MR segmentation, and many further studies have presented similar findings~\citep{Li2024a,Cevora2024,Dou2024}.
Yet most such studies do not investigate the root causes of these observed disparities -- which, as our motivating example illustrates, is crucial for successfully addressing these failure modes and making models more robust.
Motivated by this gap, we here extend the slice discovery paradigm to segmentation tasks.



Significant gaps remain in applying SDMs to segmentation tasks for medical imaging:
\begin{description}
    \item[Issue 1: Lack of focus on segmentation tasks.] Most existing SDM frameworks target classification with single-label predictions. Segmentation operates on pixel-level annotations, fundamentally expanding the failure space: models can fail through boundary errors, spatial shortcuts, or texture biases that are absent in classification. 
While recent work shows shortcut learning occurs at both sample and pixel levels~\citep{lin2024shortcut}, a systematic investigation of SDMs for segmentation failures remains absent.
\item[Issue 2: Absence of success criteria for slice discovery results.] To use slice discovery in practice, a way to evaluate whether the discovered slices genuinely capture systematic errors is needed. However, existing SDMs lack clear evaluation criteria, making it challenging to determine whether an SDM actually works.
\item[Issue 3: Lack of taxonomy for segmentation failures.] Unlike classification, segmentation lacks a systematic taxonomy of failure types. This gap hinders both the development of targeted SDM approaches and the interpretation of discovered slices.
\end{description}


Accordingly, the \textbf{core focus and contributions} of our work are:
\begin{enumerate}
    \item We adapt SDMs to segmentation for the first time through our proposed SEG4SEG pipeline (Fig.~\ref{fig:seg4seg_framework}) and analyze embedding variants tailored to 
    % the nature of 
    segmentation tasks.
    \item We redefine evaluation metrics and propose principled criteria for assessing slice discovery quality.
    \item We propose a systematic taxonomy of failure modes in medical image segmentation and benchmark our SDM across four controlled-error settings and two real-world datasets, demonstrating SEG4SEG’s effectiveness in uncovering diverse failure modes.
\end{enumerate}


\begin{figure}[tb]
    \centering
    \includegraphics[width=1.0\linewidth]{figures/pipeline_v2.pdf}
    \caption{\textbf{Illustration of the SEG4SEG framework}: Three steps are included: (1) Embedding the image/annotation information; (2) Clustering using both image embedding and performance score; (3) Inspecting the problematic slices.}
    \label{fig:seg4seg_framework}
\end{figure}

