% \section{Categorize and showcases of systematic errors in segmentation tasks}
\section{A Taxonomy of Segmentation Failure Modes}
\label{sec:taxonomy}
We first systematically categorize documented segmentation failure modes as follows. 
\subsection{Shortcuts/Unwanted Correlation}
Shortcut learning occurs when models exploit unwanted correlations between input features and targets instead of learning meaningful patterns~\citep{geirhos2020shortcut, lin2024shortcut}. 
In segmentation, shortcuts can occur at both the sample level and pixel level:

\noindent \textbf{Sample-level shortcuts} arise from unwanted correlations between \textit{sample-level features} and \textit{segmentation performance}. One example is the presence of calipers in fetal ultrasound scans: models achieve better segmentation performance when calipers are present~\citep{lin2024shortcut}. Including calipers in scans is common in clinical practice \citep{pu2022mobileunet, bano2021autofb, sun2022issmf, yang2020automatic}, yet their presence can mislead models into shortcut learning. We validate this failure mode in Sec.~\ref{sec:case_abcd} (Case A).


\noindent \textbf{Pixel-level shortcuts} arise from unwanted correlations between \textit{pixel locations} and \textit{labels}. A representative example is spatial bias in skin lesion segmentation: \citet{lin2024shortcut} demonstrated that when all training samples are center-cropped, models learn to systematically predict background labels for boundary pixels, regardless of content. Similar border artifacts appear in published models~\citep{wang2023medical, dai2022ms}, where a U-Net/CA-Net failed to recognize lesions near image boundaries, suggesting widespread reliance on spatial priors rather than visual features. We validate this failure mode in Sec.~\ref{sec:case_abcd} (Case B).

\subsection{Label Noise}
\noindent \textbf{Annotation styles} are often neglected in segmentation research, yet studies show that they can substantially affect model performance~\citep{nichyporuk2022rethinking, abhishek2024segmentation, zhang2020disentangling, zepf2023label}. Such variations encode the inherent subjectivity of labeling, driven by annotation protocols, rater expertise, and data pre-processing.  We validate this failure mode in Sec.~\ref{sec:case_abcd} (Case C).

\noindent \textbf{Annotation error}, inevitable in human-driven labeling, can introduce systematic failures for segmentation. Common manifestations are omission, inclusion errors and annotator-driven cognitive bias arising from boundary-related ambiguities~\citep{vuadineanu2022analysis}.

\subsection{Underrepresentation}
Class imbalance is a longstanding challenge in segmentation and a well-documented failure mode. Numerous remedies have been proposed~\citep{li2020analyzing, muller2022towards}, particularly for structurally small anatomical regions such as retinal vessels
~\citep{fauzi2022effect}.
Underrepresentation also manifests at the sample level, where limited coverage of certain demographic groups~\citep{puyol2021fairness} or sub-disease categories leads to models that underperform on these minority slices.


\subsection{Difficult Cases}
Beyond the factors outlined above, certain subgroups may remain intrinsically difficult to segment due to image-dependent characteristics. Typical sources of increased complexity include degraded image quality~\citep{jin2022fives}, challenging anatomical topology such as curvilinear structures~\citep{lin2023dtu}, and substantial variability in radiographic appearance~\citep{heller2021state}. We validate this failure mode in Sec.~\ref{sec:case_abcd} (Case D).



