\newpage

\appendix

\section{Ablation Study}
\subsection{Representation variants}

The ablation study on representation variants discussed in Sec.~\ref{sec:representation_design},
covering all four cases, is summarized in Tab.~\ref{tab:repr_design_seg}.
For three out of four cases, the default setting, which used the original image \(x_0\)
as the image-space input and the Dice score as the performance-space representation, achieves
the best detection performance.
In contrast, for Case~C, where the annotation style plays a critical role,
using a groud truth annotation masked $x_0$ (\(x_0 \cdot M_{\mathrm{GT}}\)) improves the detection rate,
particularly when combined with PPV-based metrics.
Notably, masking \(x_0\) using predicted annotations does not consistently improve
performance across the four cases.


\begin{table}[tb]
\centering
\caption{\textbf{Ablation study on representation variants.}
We evaluate variants of the image-space embedding, including the original image \(x_0\),
ground-truth-masked \(x_0 \cdot M_{\mathrm{GT}}\), and prediction-masked
\(x_0 \cdot M_{\mathrm{pred}}\), as well as variants of the metric-space embedding,
covering overlap-based metrics (Dice) and confusion-related metrics (PPV).
Entries report the number of successful detections out of five random-seed runs
under different failure-attribute ratios. The variant pair used in Fig.~\ref{fig:result_sdm}a is underlined.}
\label{tab:repr_design_seg}
\setlength{\tabcolsep}{6pt}
\renewcommand{\arraystretch}{1.15}

\newcolumntype{C}[1]{>{\centering\arraybackslash}p{#1}}

% \scalebox{0.95}{
\resizebox{0.8\textwidth}{!}{
\begin{tabular}{ll l C{2cm} C{2cm} C{2cm}}
\toprule
\textbf{Image Space}
& \textbf{Perf. Space}
& 
& \multicolumn{3}{c}{\textbf{Detection Rate (out of 5 seeds)}} \\
\cmidrule(lr){4-6}
 \textbf{Variant} &  \textbf{Variant} & & \multicolumn{3}{c}{Failure attribute ratio =} \\
 && &{ 5\%} & {10\%} & {20\%} \\
\midrule

% ===================== Case A =====================
\multicolumn{6}{c}{\small\textit{Case A – Sample-level Shortcuts: Calipers in Ultrasounds}}\\[-2pt]

\(x_0\)                 & \multirow{3}{*}{Dice} &  & \uline{1.0} & \uline{0.8} & \uline{1.0} \\
\(x_0 \cdot M_{\mathrm{GT}}\)   &                  &  & 1.0 & 1.0 & 1.0 \\
\(x_0 \cdot M_{\mathrm{pred}}\) &                  &  & 1.0 & 1.0 & 0.8 \\
\addlinespace[3pt]
\(x_0\)                 & \multirow{3}{*}{PPV}  &  & 1.0 & 0.8 & 0.4 \\
\(x_0 \cdot M_{\mathrm{GT}}\)   &                  &  & 1.0 & 1.0 & 1.0 \\
\(x_0 \cdot M_{\mathrm{pred}}\) &                  &  & 1.0 & 1.0 & 0.8 \\


\midrule
% ===================== Case B =====================
\multicolumn{6}{c}{\small\textit{Case B – Pixel-level Shortcuts: Central Cropping in Skin Lesions}}\\[-2pt]

\(x_0\)                 & \multirow{3}{*}{Dice} &  & \uline{0.0} & \uline{0.0} & \uline{1.0} \\
\(x_0 \cdot M_{\mathrm{GT}}\)   &                  &  & 0.0 & 0.0 & 0.8 \\
\(x_0 \cdot M_{\mathrm{pred}}\) &                  &  & 0.0 & 0.0 & 0.2 \\
\addlinespace[3pt]

\(x_0\)                 & \multirow{3}{*}{PPV}  &  & 0.4 & 0.2 & 1.0 \\
\(x_0 \cdot M_{\mathrm{GT}}\)   &                  &  & 0.6 & 0.2 & 0.4 \\
\(x_0 \cdot M_{\mathrm{pred}}\) &                  &  & 0.0 & 0.0 & 0.0 \\
\midrule
% ===================== Case C =====================
\multicolumn{6}{c}{\small\textit{Case C – Annotation Styles: Skin Lesion}}\\[-2pt]

\(x_0\)                 & \multirow{3}{*}{Dice} &  & 0.0 & 0.0 & 0.0 \\
\(x_0 \cdot M_{\mathrm{GT}}\)   &                  &  & 0.0 & 0.6 & 0.4 \\
\(x_0 \cdot M_{\mathrm{pred}}\) &                  &  & 0.0 & 0.0 & 0.0 \\
\addlinespace[3pt]

\(x_0\)                 & \multirow{3}{*}{PPV}  &  & 0.0 & 0.0 & 0.0 \\
\(x_0 \cdot M_{\mathrm{GT}}\)   &                  &  & \uline{0.0} & \uline{1.0} & \uline{0.2} \\
\(x_0 \cdot M_{\mathrm{pred}}\) &                  &  & 0.0 & 0.0 & 0.0 \\

\midrule
% ===================== Case D =====================
\multicolumn{6}{c}{\small\textit{Case D – Difficult Cases: Low Image Quality in Retinal Imaging}}\\[-2pt]

\(x_0\)                 & \multirow{3}{*}{Dice} &  & \uline{1.0} & \uline{1.0} & \uline{1.0} \\
\(x_0 \cdot M_{\mathrm{GT}}\)   &                  &  & 1.0 & 1.0 & 1.0 \\
\(x_0 \cdot M_{\mathrm{pred}}\) &                  &  & 1.0 & 1.0 & 1.0 \\
\addlinespace[3pt]

\(x_0\)                 & \multirow{3}{*}{PPV}  &  & 1.0 & 1.0 & 1.0 \\
\(x_0 \cdot M_{\mathrm{GT}}\)   &                  &  & 1.0 & 1.0 & 1.0 \\
\(x_0 \cdot M_{\mathrm{pred}}\) &                  &  & 1.0 & 1.0 & 1.0 \\

\bottomrule


\end{tabular}
}
\end{table}


\section{Dataset Details} \label{sec:appendix_dataset}

We summarize the datasets used in our experiments in Table~\ref{tab:dataset_details}. Additional details:

\begin{itemize}

    \item For ISIC2018 that used in Case B and C: The original dataset contains 2594/100/1000 samples for training/validation/testing. Some samples are discarded in Case B (manipulated experiment) because when synthesizing the dataset by cropping, we discard cropped data where no lesion remains. For the manipulated failure mode of Case C, we select only the flood-fill annotations and degrade them to polygon style, which is why the dataset size is smaller (as there is no way to accurately perform the reverse conversion from polygon to flood-fill style).
    
    \item The annotation style labels in Case C are manually annotated by the authors and will be released with the code.

    \item For Case D with FIVES, the original dataset provides annotations for three low-quality issues: illumination and color distortion, blur, and low contrast. To simplify, we create a unified low-quality label (i.e., a sample is labeled as low quality if it exhibits any of these issues). For the manipulated Case D experiment, we collect all normal-quality samples and degrade them, resulting in fewer total samples in the manipulated variant compared to the real-world variant.
    
\end{itemize}


\begin{table}[t]
\centering
\caption{Dataset statistics and failure attributes. For manipulated variants, three imbalance ratios are evaluated: 95/5, 90/10, and 80/20 (dominant/problematic). For real-world variants, ratios reflect natural distributions.}
\label{tab:dataset_details}
\resizebox{\textwidth}{!}{
% \scalebox{0.8}{
\begin{tabular}{@{}c l l l l@{}}
\toprule
\textbf{Case} & \textbf{Dataset} & \textbf{Size} & \textbf{Failure Attribute}& \textbf{Ratio (\%)} \\
\midrule
\multicolumn{4}{l}{\textit{Manipulated Failure Modes}} \\
\midrule
A & HC18 & 999 & Caliper/Non-caliper & \multirow{4}{*}{\{95/5,90/10,80/20\}}\\
B & ISIC2018 & 3673 & Central-cropped/Non-central-cropped & \\
C & ISIC2018 & 731 & Floor-fill/Polygon &\\
D & FIVES & 592 & Normal-quality/Low-quality &\\
\midrule
\multicolumn{4}{l}{\textit{Real-world Failure Modes}} \\
\midrule
\multirow{2}{*}{C} & \multirow{2}{*}{ISIC2018} & \multirow{2}{*}{3694} & \multirow{2}{*}{Flood-fill/Jagged/Polygon} & train/val/test: \\
&&&&\{32.9/23.8/43.3, 9/65/26, 14.8/63.9/21.3\} \\
D & FIVES & 800 & Normal-quality/Low-quality & train/test: \{76.5/23.5, 66.5/33.5\} \\
\bottomrule
\end{tabular}
}
\end{table}

