
\section{Problem Definition and Background}

In this section, we formalize the problem of detecting systematic performance disparities in segmentation models and review existing SDMs developed for classification tasks, which our work builds upon.




\subsection{Problem Set-up: Performance Disparities in Hidden Subgroups} \label{sec:problem}
We now formalize the learning setting and define how subgroup-related performance disparities manifest in segmentation models.
Given a dataset $D={(x,y)}_{i=1}^N$ and a hidden binary attribute $A$ (e.g. presence of caliper)
, we assume that the marginal distribution of $A$ is consistent across the train, validation, and test splits\footnote{This assumption follows prior work on SDMs and renders the SDM task more challenging compared to shortcut learning detection, where the distribution often differs in the test set.}, i.e. $P(A=1|D_{train}) = P(A=1|D_{val})  = P(A=1|D_{test}).$
We train a segmentation model $f$ using $D_{train}$ and perform model selection on $D_{val}$.
Our goal is to do detect performance disparities with respect to $A$, i.e. 
$\mathbb{E}[\text{perf}(x) \mid A = 1] \neq \mathbb{E}[\text{perf}(x) \mid A = 0].$
Our goal is to develop a method that can detect such disparities from model outputs on unseen data, without requiring access to $A$ at inference time. In our experiments, we use and manipulate known $A$, to confirm the ability to surface $A$.


\subsection{Slice Discovery Preliminaries: Concept and Motivations} 

SDMs aim to form clusters in which samples within each cluster are semantically similar, enabling the identification of slices that perform systematically worse.
In practice, SDMs operate on feature representations of test samples, such as predictions, or foundation model embeddings. Let $z(x)$ denote the representation extracted from input $x$, which serves as the basis for clustering. The choice of $z(x)$ determines which types of failure modes the SDMs can potentially uncover. For example, classification logits may reveal class-specific confusion, while encoder embeddings of the imaging captures image-level features.



Formally, let $\mathcal{D}_{\text{test}} = \{(x_i, y_i, A_i)\}_{i=1}^N$ denote the test set, where $A_i \in \{0,1\}$ is a latent binary attribute indicating failure modes (e.g., annotation style, image quality). 
Given representations $\{z(x_i)\}_{i=1}^N$, an SDM produces a partition $S = \{s_1, s_2, \ldots, s_K\}$, where each slice $s_k$ consists of samples that are close in the representation space and exhibit similar performance. 
In our case, we aim to identify slices $s \in S$ that are predominantly composed of samples with $A=1$, and whose average performance is lower compared to slices dominated by $A=0$.
Importantly, this clustering process is conducted in an unsupervised manner, without access to the underlying attribute $\{A_i\}$.




\subsection{Existing SDMs for Classification: Capabilities and Limitations}



\paragraph{Pipeline Design.} SDMs for classification have evolved from simple clustering to sophisticated cross-modal frameworks.\footnote{Some approaches mitigate subgroup disparities without uncovering their causes \citep{jain2022distilling, kim2019multiaccuracy, sohoni2020no}. These fall outside our scope, as we focus on failure mode discovery.} \citet{oakden2020hidden} pioneered the approach using $k$-means on pre-softmax features. DOMINO~\citep{eyuboglu2022domino} popularized a workflow that combines foundation model embeddings (CLIP), clustering (GMM), and natural language interpretation. Subsequent work extended this framework: FACTS~\citep{yenamandra2023facts} amplified clustering  correlations, PlaneSpot~\citep{plumb2022towards} improved on dimension reduction, and ViG-Bias~\citep{marani2024vig} integrated visual explanations.
An alternative paradigm, Spotlight~\citep{d2022spotlight}, identifies contiguous low-performance regions without discrete clustering.
However, all existing methods focus on classification, leaving segmentation unexplored.



\paragraph{Evaluation.} 
Evaluating slice discovery remains challenging. Early work evaluated case-specific results through subgroup prevalence within slices \citep{oakden2020hidden, olesen2024slicing}. DOMINO introduced precision-based metrics widely adopted in subsequent work including FACTS, but these metrics are limited to synthetic datasets with known bias attributes, motivating \citet{bissoto2025subgroup} to propose improved metrics for real-world medical imaging settings. 
Building on this work, we further refine these metrics and introduce criteria to assess whether problematic slices are discovered.



