\section{Evaluations and Visual Analysis}\label{sec:results}
The efficacy of our proposed framework was validated through extensive experiments on the CaDIS and DSAD datasets. The results, summarized in \tableref{tab:main} and \figureref{fig:results}, demonstrate the superior noise-robustness conferred by the abstention mechanism.
As anticipated, all loss functions exhibited a performance decline with increasing label noise, underscoring the universal challenge of learning from corrupted data. However, the critical distinction lies in the rate of this degradation. Our proposed abstaining loss functions consistently demonstrated a more graceful performance decline and maintained a significant advantage over their respective non-abstaining baselines, particularly at high noise intensities. On the CaDIS dataset at 25\% noise, our Abstaining Dice Segmenter (ADS) emerged as the top performer, achieving a 5.35\% mIoU lead over the standard Dice Loss. Similarly, GAC and SAC surpassed their baselines, confirming the broad applicability of our framework. This trend persisted on the more complex DSAD dataset, where despite lower overall mIoU scores, the abstaining variants maintained a clear and consistent performance advantage, highlighting their robust effectiveness even in challenging segmentation scenarios.
\begin{table}[htb]
    \renewcommand{\arraystretch}{1.25}
    \tableconts
    {tab:main}
    {\caption{Average test mIoU~(\%) and standard deviation across 5 runs of a U-Net model trained on CaDIS and DSAD datasets. We used the scalar noise rate $\boldsymbol{\tilde\eta}(\approx\eta)$ for IDAC, GAC, and SAC, and the class-wise noise vector $\boldsymbol{\tilde\eta_c}$ for ADS.
    \textbf{Gray background:} Abstaining loss functions. 
    \textbf{(*):} Our proposed novel loss functions. 
    \textbf{Structure:} The table is divided into four comparative groups (separated by double vertical lines); each group compares a baseline loss against its abstaining counterpart(s). 
    \textbf{Bold:} Indicates the best result 
    \textit{within that specific group}. For example, in the last group, we compare Dice vs. ADS to isolate the impact of our framework on the Dice loss.}}
    {
    \resizebox{\textwidth}{!}{
    \begin{tabulary}{\textwidth}{c|c|c>{\columncolor{Gray}}c>{\columncolor{Gray}}c||c>{\columncolor{Gray}}c||c>{\columncolor{Gray}}c||c>{\columncolor{Gray}}c}
    \hline
    \multirow{2}{*}{Dataset} & \multirow{2}{*}{\makecell{Noise rate \\ $\eta$~(\%)}} & \multicolumn{9}{c}{Loss function} \\
    & & CE & DAC & IDAC & GCE & GAC* & SCE & SAC* & Dice & ADS* \\
    \hline
    \multirow{6}{*}{CaDIS} & 0 & \textbf{76.02$\pm$0.70} & 75.29$\pm$0.79 & 75.36$\pm$0.73 & 73.49$\pm$3.27 & \textbf{73.76$\pm$2.80} & 75.38$\pm$0.75 & \textbf{75.83$\pm$0.62} & 76.52$\pm$0.47 & \textbf{77.04$\pm$0.37} \\
    & 5 & \textbf{73.67$\pm$1.03} & 73.14$\pm$0.46 & 72.89$\pm$0.41 & \textbf{72.83$\pm$1.11} & 71.73$\pm$2.79 & 73.41$\pm$0.71 & \textbf{73.51$\pm$1.59} & 73.48$\pm$0.28 & \textbf{75.22$\pm$0.85} \\
    & 10 & 66.39$\pm$0.17 & \textbf{67.43$\pm$0.49} & 66.92$\pm$0.49 & \textbf{64.82$\pm$0.86} & 64.16$\pm$2.57 & 65.92$\pm$0.91 & \textbf{67.29$\pm$1.65} & 66.51$\pm$0.61 & \textbf{71.12$\pm$0.55} \\
    & 15 & 64.15$\pm$2.47 & \textbf{65.85$\pm$1.05} & 64.87$\pm$0.91 & \textbf{64.81$\pm$0.46} & 64.44$\pm$2.70 & 62.16$\pm$1.99 & \textbf{65.48$\pm$2.11} & 67.31$\pm$0.73 & \textbf{70.80$\pm$1.08} \\
    & 20 & 59.56$\pm$1.21 & \textbf{63.42$\pm$0.87} & 60.54$\pm$2.27 & 60.73$\pm$1.41 & \textbf{60.91$\pm$1.64} & 57.62$\pm$4.22 & \textbf{62.70$\pm$0.31} & 63.64$\pm$0.82 & \textbf{68.88$\pm$0.49} \\
    & 25 & 52.27$\pm$1.70 & \textbf{60.63$\pm$2.73} & 58.19$\pm$4.77 & 55.71$\pm$1.30 & \textbf{59.46$\pm$0.76} & 55.08$\pm$0.93 & \textbf{61.27$\pm$1.22} & 61.04$\pm$1.41 & \textbf{66.39$\pm$0.67} \\
    \hline
    \hline
    \multirow{6}{*}{DSAD} & 0 & \textbf{34.25$\pm$2.50} & 34.01$\pm$0.96 & 33.60$\pm$0.72 & \textbf{35.14$\pm$1.65} & 32.26$\pm$0.53 & 32.78$\pm$1.19 & \textbf{33.86$\pm$1.83} & \textbf{31.28$\pm$0.87} & 30.09$\pm$1.10 \\
    & 3 & \textbf{33.69$\pm$1.85} & 33.67$\pm$2.01 & 32.76$\pm$2.03 & \textbf{33.84$\pm$2.56} & 32.94$\pm$2.23 & \textbf{32.11$\pm$1.09} & 30.90$\pm$2.76 & \textbf{30.83$\pm$4.78} & 28.64$\pm$2.76 \\
    & 6 & \textbf{30.70$\pm$2.47} & 29.47$\pm$1.97 & 29.11$\pm$2.10 & 29.69$\pm$1.96 & \textbf{29.78$\pm$4.27} & 30.51$\pm$2.16 & \textbf{31.55$\pm$2.43} & 28.56$\pm$1.00 & \textbf{30.48$\pm$3.61} \\
    & 9 & \textbf{24.65$\pm$2.90} & 24.58$\pm$2.61 & 23.47$\pm$2.48 & 22.95$\pm$2.93 & \textbf{28.84$\pm$4.17} & 28.02$\pm$2.37 & \textbf{28.55$\pm$1.29} & 19.04$\pm$1.92 & \textbf{26.23$\pm$2.05} \\
    & 12 & 21.00$\pm$3.15 & \textbf{22.59$\pm$4.35} & 20.94$\pm$1.86 & 19.84$\pm$2.89 & \textbf{25.00$\pm$4.13} & 21.57$\pm$0.67 & \textbf{23.73$\pm$0.68} & 16.15$\pm$1.49 & \textbf{22.63$\pm$0.51} \\
    & 15 & 14.41$\pm$2.59 & \textbf{17.69$\pm$3.97} & 16.24$\pm$1.45 & 14.12$\pm$2.91 & \textbf{20.01$\pm$2.56} & 15.31$\pm$0.75 & \textbf{15.91$\pm$3.53} & 14.65$\pm$1.50 & \textbf{18.05$\pm$1.63} \\
    \hline
    \end{tabulary}
    }
    }
\end{table}

\begin{figure}[htb]
    \figureconts
    {fig:results}
    {\caption{Quantitative comparison of noise-robustness. The plots show the average test mIoU~(\%) degradation as label noise $\eta$ increases. The flatter curves of the abstaining variants (solid lines) demonstrate their superior resilience compared to non-abstaining baselines (dashed lines). Our proposed losses (GAC, SAC, ADS) are in \textbf{bold}.}}
    {
    \subfigure[CaDIS Dataset]{
        \label{fig:results cadis}
        \includegraphics[width=0.47\textwidth]{figures/results_cadis_line.png}
    }
    \subfigure[DSAD Dataset]{
        \label{fig:results dsad}
        \includegraphics[width=0.47\textwidth]{figures/results_dsad_line.png}
    }
    }
\end{figure}

The flatter degradation curves for the abstaining methods in \figureref{fig:results} highlight their superior resilience. This observation is quantitatively substantiated in \tableref{tab:ci}, where we report the normalized performance drop rate ($\Delta mIoU/\Delta\eta$). Most notably, ADS demonstrated a statistically significant reduction in degradation compared to the Dice baseline on both datasets (CaDIS: $p<0.001$, DSAD: $p=0.003$). The CE-based variants showed dataset-dependent improvements: SAC yielded significant gains on CaDIS ($p=0.001$), while GAC proved significantly more robust on the challenging DSAD benchmark ($p=0.009$). This confirms that while the framework is effective, the optimal choice of base loss may depend on dataset characteristics.

A qualitative review of the segmentation masks, depicted in \figureref{fig:cadis-vis,fig:dsad-vis} further substantiates these quantitative gains. Visual inspection of the CaDIS results revealed that models trained with our abstaining losses produced markedly cleaner and more accurate segmentations than their baselines. Contours were sharper, noise artifacts were reduced, and overall structural coherence was improved. Most notably, ADS produced masks with the highest to similarity the ground truth even at the highest noise level. While the inherent difficulty of the DSAD dataset resulted in lower-quality predictions across all methods, the same relative improvements were observed. The abstaining models consistently generated more coherent masks with fewer spurious predictions and better-defined boundaries compared to their non-abstaining counterparts. This visual evidence confirms that the improvements in mIoU translate directly to more reliable and clinically relevant segmentation outputs. 
\begin{table}[htb]
    \tableconts
    {tab:ci}
    {\caption{Quantitative analysis of robustness using the \textbf{Normalized Performance Drop Rate} ($\Delta\text{mIoU}/\Delta\eta$) across 5 runs on CaDIS and DSAD. Values represent the average mIoU points lost for every 1\% increase in label noise (Mean$\pm$95\% CI over 5 seeds). Lower values indicate greater resilience. 
    \textbf{Gray background:} Abstaining loss functions. 
    \textbf{(*):} Our proposed methods.
    \textbf{Structure:} The table is grouped to compare baseline losses against their abstaining counterparts.
    \textbf{($\boldsymbol{\dagger}$):} Statistically significant improvement over baseline ($p<0.05$, paired t-test).
    }}
    {
    \begin{tabular}{lcc}
        \toprule
        Loss Function & CaDIS & DSAD \\
        \midrule
        CE   & 0.950$\pm$0.099 & 1.323$\pm$0.379 \\
        \rowcolor{Gray} DAC  & 0.587$\pm$0.167$^{\dagger}$ & 1.088$\pm$0.346\\
        \rowcolor{Gray} IDAC & 0.687$\pm$0.255 & 1.157$\pm$0.149 \\
        \hline        
        GCE  & 0.711$\pm$0.140 & 1.401$\pm$0.166 \\
        \rowcolor{Gray} GAC* & 0.572$\pm$0.140& 0.817$\pm$0.197$^{\dagger}$\\
        \hline
        SCE  & 0.812$\pm$0.068 & 1.165$\pm$0.075\\
        \rowcolor{Gray} SAC* & 0.582$\pm$0.046$^{\dagger}$ & 1.197$\pm$0.202 \\
        \hline
        Dice & 0.619$\pm$0.079 & 1.108$\pm$0.154 \\
        \rowcolor{Gray} ADS* & 0.426$\pm$0.036$^{\dagger}$ & 0.803$\pm$0.082$^{\dagger}$\\
        \bottomrule
    \end{tabular}
    }
\end{table}

\begin{figure}[htb]
    \figureconts
    {fig:cadis-vis}
    {\caption{Qualitative comparison on a CaDIS sample at 25\% noise. Our proposed abstaining losses (\textbf{GAC}, \textbf{SAC}, \textbf{ADS}) produce masks with higher fidelity and fewer artifacts than their baselines. Abstaining losses are in \textbf{bold}.}}
    {
    \subfigure[Ground truth]{
        \label{fig:cadis-gt}
        \includegraphics[width=0.175\textwidth]{samples/cadis-gt.png}
    }
    \subfigure[CE]{
        \label{fig:cadis-ce}
        \includegraphics[width=0.175\textwidth]{samples/cadis-ce.png}
    }
    \subfigure[\bfseries DAC]{
        \label{fig:cadis-dac}
        \includegraphics[width=0.175\textwidth]{samples/cadis-dac.png}
    }
    \subfigure[\bfseries IDAC]{
        \label{fig:cadis-idac}
        \includegraphics[width=0.175\textwidth]{samples/cadis-idac.png}
    }
    \subfigure[GCE]{
        \label{fig:cadis-gce}
        \includegraphics[width=0.175\textwidth]{samples/cadis-gce.png}
    }
    \\
    \subfigure[\bfseries GAC]{
        \label{fig:cadis-gac}
        \includegraphics[width=0.175\textwidth]{samples/cadis-gac.png}
    }
    \subfigure[SCE]{
        \label{fig:cadis-sce}
        \includegraphics[width=0.175\textwidth]{samples/cadis-sce.png}
    }
    \subfigure[\bfseries SAC]{
        \label{fig:cadis-sac}
        \includegraphics[width=0.175\textwidth]{samples/cadis-sac.png}
    }
    \subfigure[Dice]{
        \label{fig:cadis-dice}
        \includegraphics[width=0.175\textwidth]{samples/cadis-dice.png}
    }
    \subfigure[\bfseries ADS]{
        \label{fig:cadis-ads}
        \includegraphics[width=0.175\textwidth]{samples/cadis-ads.png}
    }
    }
\end{figure}
\begin{figure}[htb]
    \figureconts
    {fig:dsad-vis}
    {\caption{Qualitative comparison on a challenging DSAD sample at 15\% noise. Abstaining variants (in \textbf{bold}) yield masks with better structural coherence and fewer spurious activations than their baselines.}}
    {
    \subfigure[Ground truth]{
        \label{fig:dsad-gt}
        \includegraphics[width=0.175\textwidth]{samples/dsad-gt.png}
    }
    \subfigure[CE]{
        \label{fig:dsad-ce}
        \includegraphics[width=0.175\textwidth]{samples/dsad-ce.png}
    } 
    \subfigure[\bfseries DAC]{
        \label{fig:dsad-dac}
        \includegraphics[width=0.175\textwidth]{samples/dsad-dac.png}
    }
    \subfigure[\bfseries IDAC]{
        \label{fig:dsad-idac}
        \includegraphics[width=0.175\textwidth]{samples/dsad-idac.png}
    }
    \subfigure[GCE]{
        \label{fig:dsad-gce}
        \includegraphics[width=0.175\textwidth]{samples/dsad-gce.png}
    } 
    \\
    \subfigure[\bfseries GAC]{
        \label{fig:dsad-gac}
        \includegraphics[width=0.175\textwidth]{samples/dsad-gac.png}
    }
    \subfigure[SCE]{
        \label{fig:dsad-sce}
        \includegraphics[width=0.175\textwidth]{samples/dsad-sce.png}
    }
    \subfigure[\bfseries SAC]{
        \label{fig:dsad-sac}
        \includegraphics[width=0.175\textwidth]{samples/dsad-sac.png}
    } 
    \subfigure[Dice]{
        \label{fig:dsad-dice}
        \includegraphics[width=0.175\textwidth]{samples/dsad-dice.png}
    }
    \subfigure[\bfseries ADS]{
        \label{fig:dsad-ads}
        \includegraphics[width=0.175\textwidth]{samples/dsad-ads.png}
    }
    }
\end{figure}

\pagebreak
\subsection{Sensitivity to Prior Noise Rate Estimation}\label{sec:eta}
A practical concern for deployment is the potential unavailability of an accurate noise rate prior, $\tilde{\eta}$. To evaluate the sensitivity of our framework to this hyperparameter, we conducted an ablation study on both CaDIS (at true $\eta=25\%$) and DSAD (at true $\eta=15\%$), where we deliberately mis-specified the estimated $\tilde{\eta}$. For this analysis, all three losses (GAC, SAC, and ADS) utilized a single \textbf{scalar} global noise estimate to ensure a consistent comparison.

\tableref{tab:eta} presents the results, comparing the “Oracle” setting ($\tilde{\eta} \approx \eta$) against under-estimation and over-estimation. SAC displayed remarkable stability, with performance remaining virtually unchanged across all priors. GAC and ADS showed a dependency on the prior, yet even in the extreme case of an uninformed prior ($\tilde{\eta}=0$, effectively reverting to a DAC-style regularizer with our power-law schedule), both methods outperformed their non-abstaining baselines.

It is notable that the peak ADS performance in this scalar sweep ($63.07\%$ on CaDIS) is lower than the main result reported in \tableref{tab:main} ($66.39\%$), which used class-specific noise vectors $\boldsymbol{\tilde\eta_c}$. This discrepancy highlights the importance of the class-wise formulation. As shown in \appendixref{app:class_noise}, both datasets contain extreme inter-class noise variance (ranging from $9.7\%$ to $91.1\%$ for CaDIS, and from $5.7\%$ to $94.6\%$ for DSAD) at their highest noise levels. Using a single scalar prior (e.g., $25\%$) inevitably under-estimates the noise for difficult classes, hampering the model's ability to abstain effectively on them. However, the results confirm that even without this granular information, ADS remains effective and robust to the scalar estimate itself.

\begin{table}[htb]
    \tableconts
    {tab:eta}
    {\caption{Sensitivity analysis of GAC, SAC, and ADS under varying scalar estimated noise rates $\tilde{\eta}$. The "Oracle" settings are underlined. While SAC is nearly invariant to $\tilde{\eta}$, GAC and ADS show moderate sensitivity but remain robust even when $\tilde{\eta}=0$. Note that for this experiment we employed the global scalar noise rate $\tilde\eta$ for ADS, unlike \tableref{tab:main} where we used $\tilde\eta_c$ to showcase ADS's capabilities under optimal conditions.}}
    {
    \begin{tabulary}{\textwidth}{c|c|ccc}
    \hline
    \multirow{2}{*}{Dataset} & \multirow{2}{*}{$\tilde\eta$~(\%)} & \multicolumn{2}{c}{Loss function} \\
    & & GAC & SAC & ADS \\
    \hline
    \multirow{5}{*}{CaDIS} & 0 & 57.34$\pm$1.12 & 61.27$\pm$1.23 & 61.53$\pm$1.41 \\
    & 15 & 59.35$\pm$0.78 & 61.27$\pm$1.22 & 62.55$\pm$1.00 \\
    & \underline{25 (Oracle)} & 59.46$\pm$0.76 & 61.27$\pm$1.22 & 63.07$\pm$0.96 \\
    & 35 & 59.47$\pm$0.73 & 61.27$\pm$1.22 & 63.51$\pm$0.92 \\
    & 50 & 59.77$\pm$0.99 & 61.28$\pm$1.22 & 63.90$\pm$1.12 \\
    \hline
    \multirow{5}{*}{DSAD} & 0 & 20.30$\pm$2.86 & 15.87$\pm$3.46 & 18.29$\pm$2.14 \\
    & 10 & 20.11$\pm$2.66 & 15.87$\pm$3.47 & 17.94$\pm$1.95 \\
    & \underline{15 (Oracle)} & 20.01$\pm$2.56 & 15.91$\pm$3.53 & 18.10$\pm$2.14 \\
    & 20 & 20.18$\pm$2.71 & 15.81$\pm$3.39 & 17.71$\pm$1.55 \\
    & 30 & 19.02$\pm$3.22 & 15.91$\pm$3.54 & 18.03$\pm$1.45 \\
    \hline
    \end{tabulary}
    }
\end{table}