\section{The Universal Abstention Framework}\label{sec:method}
Building upon the demonstrated efficacy of DAC and IDAC in mitigating label noise through abstention, we propose an enhanced and universal definition of the abstention mechanism that can be readily adapted to virtually any underlying loss function, $\mathcal{L}_X(x_j)$. Our generalized abstaining loss is formulated as:
\begin{equation}\label{eq:abstention}
    \mathcal{L}_{abstention}(x_j) = (1-p_{k+1})\mathcal{L}_{X}(x_j) + \alpha\left|\log\frac{1-\tilde\eta}{1-p_{k+1}}\right|
\end{equation}
With this formulation, we introduce two critical innovations designed to provide greater flexibility and more targeted noise mitigation.
\subsection{Informed Regularization}
The first improvement lies in the \textbf{regularization term} $\alpha\left|\log\frac{1-\tilde\eta}{1-p_{k+1}}\right|$. This term draws inspiration from IDAC by explicitly incorporating the expected noise rate $\tilde\eta$ to guide the abstention behaviour. Unlike DAC, which pushes the abstention probability $p_{k+1}$ toward zero, our term incentivises the model to maintain $p_{k+1}$ in proximity to $\tilde\eta$. This allows the model to continue abstaining on samples it confidently perceives as noisy, rather than being forced to make classification decisions that could elevate the risk of overfitting to noise. This enhanced definition of the regularization term is also flexible; if a reliable estimate for $\tilde\eta$ isn't available, setting $\tilde\eta=0$ effectively reduces the term to its original DAC form, which has already demonstrated its strength and effectiveness in combating label noise.

\subsection{Power-Law Auto-Tuning}\label{sec:gamma}
The second and more significant enhancement concerns the \textbf{$\boldsymbol{\alpha}$ auto-tuning algorithm}. The original algorithm proposed by DAC employed a linear ramp-up strategy for $\alpha$ after a warm-up phase, which, while effective, offered limited flexibility in controlling the learning trajectory. Our refined approach replaces this with a simpler yet more powerful and flexible method. For every epoch $e$ after an initial warm-up phase of $L$ epochs out of a total $E$ epochs, $\alpha$ is dynamically calculated as:
\begin{equation}\label{eq:alpha}
    \alpha = \alpha_{final} * \left(\frac{e-L}{E-L}\right)^\gamma
\end{equation}
In this equation, $\gamma>0$ serves as a growth factor that precisely controls the rate at which $\alpha$ increases throughout the abstention phase, as depicted in \appendixref{app:gamma}. The behaviour of $\alpha$ is modulated by $\gamma$: if $\gamma>1$ , $\alpha$ exhibits a sublinear growth, increasing slowly at the beginning of the abstention period and accelerating its growth towards the end of training. This behaviour intensifies with larger values of $\gamma$. Conversely, if $\gamma<1$, $\alpha$ experiences superlinear growth early in the abstention phase, with its rate of increase slowing down as training progresses. Setting $\gamma=1$ yields a linear increment, akin to DAC's approach. This formulation provides significant flexibility in penalizing and guiding the abstention behaviour, enabling a more optimal balance between the model's learning from clean data and its strategic abstention from noisy or ambiguous samples. 

\subsection{Novel Abstaining Loss Functions for Segmentation}
We demonstrate our framework's versatility by creating three novel, noise-robust loss functions, one of which is tailored for segmentation.

\subsubsection{Abstaining Classifiers (GAC and SAC)} 
We first integrate our framework with two CE-based losses. The \textbf{G}eneralized \textbf{A}bstaining \textbf{C}lassifier (GAC) combines abstention with Generalized Cross Entropy (GCE) \cite{zhang2018generalized}, creating a dual defence where GCE's bounded loss attenuates noise on classified samples, while abstention filters out the most corrupted ones. The \textbf{S}ymmetric \textbf{A}bstaining \textbf{C}lassifier (SAC) enhances Symmetric Cross Entropy (SCE) \cite{wang2019symmetric}, empowering the model to completely disengage from highly suspect samples, rather than merely re-balancing their influence. SAC can actively filter out the most egregious noisy examples, allowing the symmetrical CE-RCE components to focus on refining predictions for the more reliable data. 

\subsubsection{Abstaining Dice Segmenter (ADS)}
Our most significant adaptation is the \textbf{A}bstaining \textbf{D}ice \textbf{S}egmenter (ADS)\footnote{The \textit{Segmenter} in ADS highlights its design for segmentation tasks, contrasting with the other 'Classifier' losses which can also be used for classification.}, which integrates our framework with the region-based Dice loss \cite{milletari2016v}. This required two fundamental architectural changes to resolve the incompatibility between Dice's class-wise nature and standard pixel-wise abstention:
\begin{itemize}
    \item \textbf{Class-wise Abstention Head:} We re-conceptualized the network's output to produce class-wise abstention predictions. As illustrated in \figureref{fig:output_c}, a specialized module uses Adaptive Average Pooling with an output size $s\times s$, followed by a Linear layer and \textit{sigmoid} activation to output a unique abstention probability for each of the $k$ classes.
    \item \textbf{Class-specific Regularization:} To complement the class-wise abstention, the regularization term in \equationref{eq:abstention} is formulated to accept a vector of class-specific noise estimates, $\boldsymbol{\tilde\eta_c}$ (calculation detailed in \appendixref{app:class_noise}). This enables granular control over abstention behavior per anatomical structure. Crucially, the design preserves flexibility: if detailed class-wise statistics are infeasible to obtain, ADS seamlessly accepts a single global noise estimate $\tilde\eta$ applied uniformly across all classes. However, as we will demonstrate in Section \ref{sec:eta} and \appendixref{app:class_noise}, this scalar simplification can be suboptimal given the high inter-class noise variance typical in segmentation. In such scenarios, a superior strategy would be to employ methods that dynamically estimate noise rates from the data, such as Confident Learning \cite{northcutt2021confident}, Beta Mixture Models \cite{arazo2019unsupervised}, or Transition Matrix Estimation \cite{xia2019anchor}.
\end{itemize}
\begin{figure}[htb]
    \floatconts
    {fig:ads_output}
    {\caption{Transforming the output layer from standard pixel-wise abstention (a) to our proposed class-wise abstention head for \textbf{ADS} (b). The dimensions $b, c, h, w$ represent batch size, number of classes or channels, height, and width, respectively. $(s,s)$ is the output size for the Adaptive Average Pool layer.}}
    {
    \subfigure[Pixel-wise abstention]{
        \label{fig:output_a}
        \includegraphics[width=0.4\textwidth]{figures/dac.pdf}
    }
    \subfigure[Class-wise abstention]{
        \label{fig:output_c}
        \includegraphics[width=0.55\textwidth]{figures/ads.pdf}
    }
}
\end{figure}