\appendix
\section{DAC $\alpha$ auto-tuning algorithm} \label{app:alpha}
For reference and comparison, \algorithmref{alg:alpha} outlines the original linear auto-tuning schedule proposed by \citet{thulasidasan2019combating}. As illustrated, this approach requires a stateful, iterative update process where $\alpha$ is initialized based on performance during a warm-up phase (Lines 3-9) and then incremented by a fixed $\delta_\alpha$ at each step (Lines 16-19). In contrast, our proposed framework (\sectionref{sec:method}) simplifies this process significantly. By replacing this iterative logic with the direct power-law formulation in \equationref{eq:alpha}, we eliminate the need for state tracking and intermediate variable initialization, while simultaneously offering greater flexibility in the curriculum schedule.

\begin{algorithm2e}
\LinesNumbered
\caption{$\alpha$ auto-tuning}
\label{alg:alpha}
\KwIn{total iter. ($T$), current iter. ($t$), total epochs ($E$), abstention-free epochs ($L$), current epoch ($e$), $\alpha$ init factor ($\rho$), final $\alpha$ ($\alpha_{final}$), mini-batch cross-entropy over true classes ($\mathcal{H}_c(P^M_{1\dots K})$)}
$\alpha_{set} = \text{False}$\;
\For{$t := 0$ \KwTo $T$}{
    \If{$e < L$}{
        $\beta = (1-P^M_{k+1})\mathcal{H}_c(P^M_{1\dots K})$\;
        \If{$t = 0$}{
            $\tilde{\beta} = \beta$ \tcp*[r]{\{initialize moving average \}}
        }
        $\tilde{\beta} \leftarrow (1-\mu)\tilde{\beta} + \mu\beta$\;
    }
    \If{$e = L$ \textbf{and not} $\alpha_{set}$}{
        $\alpha := \tilde{\beta}/\rho$ \tcp*[r]{\{initialize $\alpha$ at start of epoch $L$ \}}
        $\delta_{\alpha} := \frac{\alpha_{final}-\alpha}{E-L}$\;
        $update_{epoch} = L$\;
        $\alpha_{set} = \text{True}$\;
    }
    \If{$e > update_{epoch}$}{
        $\alpha \leftarrow \alpha + \delta_{\alpha}$ \tcp*[r]{\{then update $\alpha$ once every epoch \}}
        $update_{epoch} = e$\;
    }
}
\end{algorithm2e}

\pagebreak
\section{Abstention Dynamics during Training}
To better understand how different loss functions utilize the abstention mechanism throughout the training process, we visualized the batch-wise abstention rate over time. \figureref{fig:abstention_dynamics} depicts the training trajectory for DAC, IDAC, and our proposed GAC on the CaDIS dataset with a synthetic noise rate of $\eta=15\%$. The plot reveals distinct behaviours after the warm-up phase. The original DAC (blue) exhibits a rapid collapse in abstention after an initial spike. The penalty forces the abstention rate effectively to zero, meaning the model stops utilizing the mechanism and risks overfitting to noisy labels. IDAC (orange) avoids zero, but exhibits high variance. In contrast, our proposed GAC (green) demonstrates a controlled and graceful descent, eventually stabilizing at an abstention rate of approximately 15\%.

\begin{figure}[htb]
    \floatconts
    {fig:abstention_dynamics}
    {\caption{Evolution of the abstention rate during training on the CaDIS dataset with 15\% label noise.}}
    {\includegraphics[width=0.8\textwidth]{figures/abstention15.png}}
\end{figure}

\subsection{Alpha Auto-Tuning Behaviour}\label{app:gamma}
As described in \sectionref{sec:gamma}, our framework utilizes a power-law-based auto-tuning algorithm for the abstention penalty $\alpha$. \figureref{fig:gamma} visually demonstrates the effect of the growth factor $\gamma$ on the trajectory of $\alpha$ throughout the training process, enabling sublinear, linear, or superlinear growth.
\begin{figure}[htb]
    \floatconts
    {fig:gamma}
    {\caption{The effect of different values of $\gamma$ on the growth of $\alpha$ with $\alpha_{final}=1$.}}
    {\includegraphics[width=0.59\linewidth]{figures/gamma.png}}
\end{figure}

\section{Additional Experimental Details and Method Parameters}\label{sec:appendix}
This appendix provides supplementary details regarding our experimental setup and the parameters of our proposed abstention framework.

\subsection{Global Training Hyperparameters}\label{app:train_params}
All experiments in the main paper were conducted using a consistent set of global training parameters to ensure a fair comparison. These parameters, including the network architecture, optimizer, and learning rate schedule, are detailed in \tableref{tab:train_params}.
\begin{table}[htb]
    \floatconts
    {tab:train_params}
    {\caption{Training hyperparameter configurations used in our experiments.}}
    {\begin{tabular}{@{}ll@{}}
        \toprule
        \textbf{Parameter} & \textbf{Value} \\ 
        \midrule
        Architecture & U-Net \\
        Backbone & Pretrained ResNet-50 \\
        Optimizer & AdamW \\
        Epochs & 50 \\
        Initial Learning Rate & 0.003 \\
        LR Schedule & Step decay; factor of 0.2 every 10 epochs \\
        Batch Size (CaDIS) & 128 \\
        Batch Size (DSAD) & 50 \\
        Seed Runs & 5 \\
        \bottomrule
    \end{tabular}}
\end{table}
\vfill
\subsection{Loss Function-Specific Hyperparameters}\label{app:loss_params}
The hyperparameters for each loss function were selected through a rigorous two-stage optimization process on the validation set, using the highest noise level for each dataset ($\eta=25\%$ for CaDIS, $\eta=15\%$ for DSAD) to prioritize robustness. First, we conducted a Bayesian search using Weights \& Biases to identify the high-performing ranges for each parameter. Second, based on these ranges, we performed a fine-grained grid search over a set of interpretable, discrete values (e.g., stepping $\alpha_{final}$ by $0.5$ or $1.0$) to select the final optimal configuration. This approach ensures that the chosen hyperparameters are both effective and generalizable, avoiding overfitting to specific float values found during random search. The final parameters used to generate the results in our paper are listed in \tableref{tab:loss_params}.
\begin{table}[htb]
    \floatconts
    {tab:loss_params}
    {\caption{The hyperparameter configurations for each loss function. $L$ is the number of warm-up epochs, $\alpha$ is IDAC's fixed abstention penalty, and $\alpha_{final}$ is the target penalty for DAC, GAC, SAC, and ADS. $\gamma$ is the growth factor for our enhanced $\alpha$ auto-tuning algorithm, and $s$ is the pooling output size for the class-wise abstention module in ADS. Note that $\alpha$ represents the fixed abstention penalty for IDAC, and the Cross Entropy coefficient for SCE.}}
    {
    \begin{tabulary}{\textwidth}{c|c|c|c|c|c|c|c}
        \hline
        Dataset & DAC & IDAC & GCE & GAC & SCE & SAC & ADS \\
        \hline
        CaDIS 
        & \makecell{$\alpha_{final}=1$\\$L=10$} 
        & \makecell{$\alpha=1$\\$L=10$} 
        & $q$=0.5 
        & \makecell{$\alpha_{final}=3$\\$L=10$\\$\gamma=3$} 
        & \makecell{$\alpha=1$\\$\beta=1$} 
        & \makecell{$\alpha_{final}=1$\\$L=10$\\$\gamma=1.5$} 
        & \makecell{$\alpha_{final}=1$\\$L=10$\\$\gamma=3$\\$s=16$} \\
        \hline\hline
        DSAD 
        & \makecell{$\alpha_{final}=2$\\$L=18$} 
        & \makecell{$\alpha=1$\\$L=10$} 
        & $q$=0.1 
        & \makecell{$\alpha_{final}=2$\\$L=15$\\$\gamma=2$} 
        & \makecell{$\alpha=0.5$\\$\beta=1$} 
        & \makecell{$\alpha_{final}=1$\\$L=20$\\$\gamma=3$} 
        & \makecell{$\alpha_{final}=4$\\$L=10$\\$\gamma=1.5$\\$s=16$} \\
        \hline
    \end{tabulary}
    }
\end{table}

\section{Visualization of Synthetic Noise}\label{app:noise}
To provide a visual assessment of the difficulty and realism of our synthetic noise injection protocol, we present qualitative examples from both the CaDIS and DSAD datasets in \figureref{fig:cadis-noise} and \figureref{fig:dsad-noise}, respectively. The visualizations display the progression of corruption alongside \textbf{Difference Maps}, where white pixels indicate disagreement between the ground truth and the noisy mask. These maps clearly highlight that our noise generation strategy creates not only random semantic errors (large flipped regions) but also challenging structural artifacts along object boundaries, mimicking the inter-rater variability often seen in clinical annotations.
\begin{figure}[htbp]
    \figureconts
    {fig:cadis-noise}
    {\caption{Visualization of synthetic label noise on a sample CaDIS frame. The figure illustrates the progression of corruption from low ($\eta=5\%$) to severe ($\eta=25\%$) noise levels.(a)-(b) Show the original surgical view and the clean ground truth. (c)-(h) Display the \textbf{Noisy Overlays} (transparent mask on image) and corresponding \textbf{Difference Maps} at 5\%, 15\%, and 25\% noise. In the Difference Maps, \textbf{Black} indicates agreement with the ground truth, while \textbf{White} indicates corrupted pixels.}}
    {
    \subfigure[Original image]{
        \label{fig:cadis-noise-og}
        \includegraphics[width=0.225\textwidth]{samples/cadis_noise_orig.png}
    }
    \subfigure[Ground truth]{
        \label{fig:cadis-gt}
        \includegraphics[width=0.225\textwidth]{samples/cadis_noise_gt.png}
    }
    \subfigure[5\% Overlay]{
        \label{fig:cadis-overlay1}
        \includegraphics[width=0.225\textwidth]{samples/cadis_overlay1.png}
    }
    \subfigure[5\% Diff Map]{
        \label{fig:cadis-diff1}
        \includegraphics[width=0.225\textwidth]{samples/cadis_diff1.png}
    }
    \\
    \subfigure[15\% Overlay]{
        \label{fig:cadis-overlay3}
        \includegraphics[width=0.225\textwidth]{samples/cadis_overlay3.png}
    }
    \subfigure[15\% Diff Map]{
        \label{fig:cadis-diff3}
        \includegraphics[width=0.225\textwidth]{samples/cadis_diff3.png}
    }
    \subfigure[25\% Overlay]{
        \label{fig:cadis-overlay5}
        \includegraphics[width=0.225\textwidth]{samples/cadis_overlay5.png}
    }
    \subfigure[25\% Diff Map]{
        \label{fig:cadis-diff5}
        \includegraphics[width=0.225\textwidth]{samples/cadis_diff5.png}
    }
    }
\end{figure}

\begin{figure}[htbp]
    \figureconts
    {fig:dsad-noise}
    {\caption{Visualization of synthetic label noise on a sample DSAD frame. This dataset presents a challenging scenario with sparse annotations and complex anatomy. (a)-(b) Show the raw laparoscopic image and the ground truth. (c)-(h) Illustrate the impact of noise at 3\%, 9\%, and 15\% levels. The \textbf{Difference Maps} (White = Error) highlight that even at lower noise rates ($\eta=3\%$), significant structural corruption is introduced along the organ boundaries. At higher rates ($\eta=15\%$), the corruption creates large, misleading semantic regions that severely test model robustness.}}
    {
    \subfigure[Original image]{
        \label{fig:dsad-noise-og}
        \includegraphics[width=0.225\textwidth]{samples/dsad_noise_orig.png}
    }
    \subfigure[Ground truth]{
        \label{fig:dsad-gt}
        \includegraphics[width=0.225\textwidth]{samples/dsad_noise_gt.png}
    }
    \subfigure[3\% Overlay]{
        \label{fig:dsad-overlay1}
        \includegraphics[width=0.225\textwidth]{samples/dsad_overlay1.png}
    }
    \subfigure[3\% Diff Map]{
        \label{fig:dsad-diff1}
        \includegraphics[width=0.225\textwidth]{samples/dsad_diff1.png}
    }
    \\
    \subfigure[9\% Overlay]{
        \label{fig:dsad-overlay3}
        \includegraphics[width=0.225\textwidth]{samples/dsad_overlay3.png}
    }
    \subfigure[9\% Diff Map]{
        \label{fig:dsad-diff3}
        \includegraphics[width=0.225\textwidth]{samples/dsad_diff3.png}
    }
    \subfigure[15\% Overlay]{
        \label{fig:dsad-overlay5}
        \includegraphics[width=0.225\textwidth]{samples/dsad_overlay5.png}
    }
    \subfigure[15\% Diff Map]{
        \label{fig:dsad-diff5}
        \includegraphics[width=0.225\textwidth]{samples/dsad_diff5.png}
    }
    }
\end{figure}

\pagebreak
\subsection{Class-wise Noise Distribution}\label{app:class_noise}
To better understand the performance gap between the scalar-prior ADS (Table \ref{tab:eta}) and the vector-prior ADS (Table \ref{tab:main}), we provide the calculated class-wise noise rates used for the vector experiments.
\begin{itemize}
    \item CaDIS (at global $\eta=25\%$): $[68.2\%, 19.6\%, 36.6\%, 80.5\%, 91.1\%, 41.4\%, 23.7\%, 9.7\%]$.
    \item DSAD (at global $\eta=15\%$): $[5.7\%, 33.7\%, 66.7\%, 59.5\%, 94.6\%, 90.9\%, 92.1\%, 45.5\%]$.
\end{itemize}
The noise rate for a specific class $c$ as the fraction of pixels belonging to class $c$ in the ground truth that are corrupted in the noisy mask: 
\begin{equation*}
\tilde{\eta}_c = \frac{\sum_{i} \mathds{1}(\hat{y}_i \neq c \land y_i = c)}{\sum_{i} \mathds{1}(y_i = c)} 
\end{equation*}
The high variance in these values demonstrates why a single scalar prior (e.g., 25\%) can be suboptimal for ADS, as it drastically underestimates the corruption for certain classes (e.g., 91.1\%), limiting the model's ability to abstain where it is needed most.