\section{SEG4SEG: Slice Discovery Methods for Segmentation}
\label{sec:seg4seg}


Here, we propose SEG4SEG (Systematic Error Grounding for SEGmentation), which extends SDM to segmentation problems. The overall pipeline extends DOMINO~\citep{eyuboglu2022domino} which was developed for classification.


\subsection{Method Overview}





SEG4SEG consists of three steps (see Fig.~\ref{fig:seg4seg_framework}):


\noindent \textbf{Embedding the image space and performance metrics}\footnote{We treat both image features and performance metrics as embeddings, as both serve to compress information into a lower-dimensional space; despite the latter are not traditionally considered as embeddings.}. Common approaches for image embedding include foundation models (e.g. CLIP) and latent representations from related pre-trained tasks. 
In this work, we use CLIP for image embedding and include both Dice score and per-image Positive Predictive Value as performance embeddings. 
To enable efficient clustering\footnote{High dimensionality leads to substantial inefficiency for most clustering methods~\citep{umap_clustering_doc}}, we apply UMAP~\citep{mcinnes2018umap} for dimension reduction of the image embeddings.

\noindent \textbf{Clustering embedding information}. Multiple clustering methods have been used in previous research, including k-means, HDBSCAN, and Gaussian Mixture Models (GMM). We use GMM in this work to enable flexible weighting between different variables. Specifically, we optimize
\begin{equation}
\label{eq:gmm}
\ell(\phi) = \sum_{i=1}^{n} \log \sum_{j=1}^{|S|}
P\left(S^{(j)} = 1\right)
P\left(z(x_i) \mid S^{(j)} = 1\right)
P\left( \text{perf} (\hat{y}_i,y_i) \mid S^{(j)} = 1\right)^\gamma,
\end{equation}
where $z$ is the embedding model; $x_i$, $y_i$, and $\hat{y}_i$ denote the image, its annotation, and the prediction, respectively; and $\gamma$ is a weighting factor balancing the effect of performance embeddings and image embeddings.


\noindent \textbf{Inspecting the clustering results}. After clustering, we analyze the resulting clusters to identify potential issues in the dataset or model. This step typically requires additional annotations or analytical tools to characterize the discovered slices.



\subsection{Representation Design for Segmentation Tasks} \label{sec:representation_design}
Since segmentation operates on pixel-level, we extend both the image and metrics representations to encode richer spatial information for more effective clustering.
Specifically, we extend the representation as follows:
% to the following as shown in Tab.~\ref{tab:variants}
\textbf{(a) Variants of image space embedding:}
Apart from using embeddings from the original image $x_0$ as in the classification task, we explore alternative inputs by masking the image with either the ground-truth or predicted mask ($x_0 \cdot M_{\text{GT}}$ or $x_0 \cdot M_{\text{pred}}$),
where non-foreground pixels are zeroed out, to encode additional mask information.
\textbf{(b) Variants of performance space embedding}: Beyond commonly used overlap metrics (Dice score), we consider confusion-related metrics (Positive and negative predictive values: PPV, NPV) that distinguish FP and FN behaviors.









\subsection{Evaluation Metrics for Slice Discovery}
Following \citet{bissoto2025subgroup}, we evaluate SDM performance through the trade-off between performance disparities and slice purity. The goal is to discover slices that are pure regarding attribute $A$ and exhibit performance that deviates meaningfully from the overall population.

\paragraph{Performance disparity.} In contrast to prior work, we measure performance disparity using the Omega Square ($\omega^2$) measure \citep{cohen1973eta}, which quantifies the proportion of variance in performance explained by the partition $S$. Intuitively, $\omega^2$ measures how much better the partition is at separating high/low-performance samples compared to random grouping.
We adopt $\omega^2$ instead of the difference between best and worst subgroup performance used by \citet{bissoto2025subgroup}, because the latter is sensitive to small cluster sizes and spurious results, whereas $\omega^2$ provides a more robust effect size measure.  

\paragraph{Purity.} Slice purity measures the homogeneity of a partition $S$ with respect to attribute $A$:
$AP(S) = \frac{1}{|A|} \sum_{a \in A} 
        \max\limits_{s \in \hat{S}} 
        \left( \frac{n_{s,a}}{n_s} \right)$,
where \(n_{s,a}\) denotes the number of samples belonging to
attribute \(a\) within slice \(s\), and \(n_s\) is the total number of samples in slice \(s\).
Here, $\hat{S}$ denotes the subset of all clusters $S$ satisfying $n_s > N_{\min}$. Intuitively, we exclude very small clusters when computing purity to avoid their noise-dominated behavior from inflating the metric. 





The purity metric $AP$ does \textit{not} penalize impure clusters if each attribute value $a \in A$ is concentrated in at least one slice. 
Both metrics aim to identify slices with high attribute purity and low performance. Impure clusters are acceptable, as representations may capture multiple data characteristics 
beyond target attributes.


\subsection{Defining the success of finding the problematic slice}
The two metrics introduced above assess, respectively, (i) whether an SDM can identify slices with low performance, and (ii) whether the SDM can isolate slices with high attribute purity. However, taken separately, these metrics do not directly answer the key question of interest: \emph{\textbf{Does the SDM successfully locate the problematic slice(s)?}}

To close this gap, we combine the two metrics and propose the following validation criterion to determine whether an SDM successfully identifies a problematic slice:



\begin{definition}[SDM Success Criterion]
\noindent
\normalfont
An SDM is considered to successfully identify a problematic slice if \emph{at least one} slice $s$ in the dataset satisfies all of the following conditions:
\begin{enumerate}
    \item \textbf{High Purity:} 
    $P_s(A=a_f) > p_\theta$.  
    The proportion of samples in slice $s$ that share the failure-related attribute (noted as $a_f$) value exceeds the purity threshold.
    \item \textbf{Low Performance:}
    $q(\text{perf}(s)) \le q_\theta$.  
    The quantile of the average performance of slice $s$ falls below the performance threshold.
    \item \textbf{Sufficient Slice Size:}
    $n_s \ge N_{\min}$.  
    The slice $s$ contains at least $N_{\min}$ samples.
\end{enumerate}
\end{definition}


\noindent
If such a slice exists, it indicates that in a real-world setting where attribute labels are not always available, the SDM can locate low-performance slices, and the high purity in the slice might then help a human inspector to form a hypothesis as to the cause of the model failure.
We set $p_\theta=0.8$ to ensure clusters are sufficiently homogeneous (80\% purity), $q_\theta=0.4$ to capture the bottom 40\% of clusters by performance, and $N_{min}=2$\% of the test set to enable detection of rare failure modes that may comprise only 5\% of samples.
