\section{Related Work} \label{sec:related}
The deployment of deep learning in medical imaging is frequently hampered by data imperfections, ranging from data scarcity and class imbalance \cite{tomar2025first, tomar2025effective} to missing or uncertain annotations \cite{schneider2023spml}. Within this landscape, label noise remains a particularly pervasive challenge that has been explored through several avenues \cite{karimiDeepLearningNoisy2020}. 

Early strategies operated at different granularities, seeking to identify and correct noisy labels. These included pixel-wise adaptive weight maps, as proposed by \citet{shiDistillingEffectiveSupervision2021}, which dynamically adjust the contribution of each pixel to the loss, and graph-based label correction by \citet{yiLearningPixelLevelLabel2022}, which models spatial relationships to propagate corrections from reliable to unreliable pixels. Other frameworks operated at the image level to assess overall annotation quality, with some combining both pixel- and image-level perspectives to distill supervision more effectively \cite{shiDistillingEffectiveSupervision2021,zhuPickandLearnAutomaticQuality2019}. Recognizing that segmentation noise is often not random but spatially correlated, other works have proposed explicit noise models. The Markov models from \citet{yaoLearningSegmentNoisy2023}, for instance, simulate realistic boundary distortions, while methods like LVC-Net by \citet{shuLVCNetMedicalImage2019} leverage local visual cues to guide the network away from incorrect labels during training.

A significant approach exploits the intrinsic learning dynamics of deep networks, particularly the `early-learning' phenomenon, where models tend to fit clean, simple patterns before eventually memorizing the noise present in incorrect labels \cite{liuAdaptiveEarlyLearningCorrection2022,yeActiveNegativeLoss2024}. This observation has led to the development of adaptive correction methods like ADELE by \citet{liuAdaptiveEarlyLearningCorrection2022}, which detects the onset of memorization for each semantic class to intervene at the optimal moment. In a similar vein, multi-network and co-training paradigms leverage the consensus or disagreement between two or more models to filter out noisy signals. By using diverse architectures, these methods reduce the risk of confirmation bias, where a single model reinforces its own errors \cite{liSemiSupervisedSemanticSegmentation2023,rongBoundaryenhancedCotrainingWeakly2023}.

Furthermore, emerging paradigms reframe the problem by treating large-scale noisy labels not as a hindrance but as a valuable resource. Pretraining strategies use massive, imperfectly labelled datasets to learn robust feature representations that can be fine-tuned on smaller, clean datasets \cite{liuCromSSCrossmodalPretraining2025}. Other methods use meta-learning to bootstrap robust models, such as L2B by \citet{zhouL2BLearningBootstrap2024}, which learns to dynamically weight the influence of observed labels and model-generated pseudo-labels during training.

An alternative and more fundamental approach that is most relevant to our work involves designing inherently robust loss functions. Instead of relying on external modules for noise detection or correction, this strategy embeds noise tolerance directly into the optimization objective \cite{karimiDeepLearningNoisy2020}. Examples include the T-Loss from \citet{gonzalez-jimenezRobustTLossMedical2023}, which is based on the heavy-tailed Student-t distribution to reduce the influence of outliers, and the Active Negative Loss (ANL) framework proposed by \citet{yeActiveNegativeLoss2024}. Our work contributes to this line of research, but with a distinct and more modular philosophy. Instead of designing a new loss function from scratch, we propose a modular mechanism that can enhance the inherent robustness of existing losses.

A promising strategy for mitigating the impact of label noise is to empower a model to abstain from making a prediction on samples it deems unreliable. This approach circumvents the core problem of standard supervised learning, where a model is forced to commit to a prediction, potentially leading it to memorize erroneous labels. The concept was formally introduced for deep learning in the Deep Abstaining Classifier (DAC) by \citet{thulasidasan2019combating}. The DAC framework enables abstention by augmenting a network's architecture with an additional $(k+1)$-th output neuron, which explicitly represents the choice to abstain. Its corresponding loss function is defined as:
\begin{equation}\label{eq:dac}
    \mathcal{L}_{DAC}(x_j)  = (1-p_{k+1})\left(-\sum^k_{i=1} t_i \log\frac{p_i}{1-p_{k+1}}\right) + \alpha\log\frac{1}{1-p_{k+1}}
\end{equation}
where $t_i$ is the ground truth label for class $i$. The loss is composed of two competing terms. The first is a modified Cross Entropy (CE) loss scaled by $(1-p_{k+1})$, which represents the confidence in \textit{not} abstaining, while the classification probability $p_i$ for each class $i$ is re-normalized by the same factor. 
The second component is a regularization term that directly penalizes the act of abstaining, where $p_{k+1}$ is the model's predicted probability of abstention \cite{thulasidasan2019combating}.

The abstention penalty is controlled by the hyperparameter $\alpha$. Crucially, DAC employs an adaptive auto-tuning schedule where, after an initial warm-up period, $\alpha$ is initialized to a small value and is linearly increased over the remaining epochs up to a predefined value, $\alpha_{final}$. This ramp-up strategy, detailed in \appendixref{app:alpha}, acts as a form of curriculum learning, initially permitting the model to ignore noisy samples and progressively forcing it to learn from more challenging data as its confidence grows \cite{thulasidasan2019combating}.

Building directly upon this foundation, \citet{schneider2024informed} proposed the Informed Deep Abstaining Classifier (IDAC) to create a more targeted response to noise. IDAC refines the abstention mechanism by incorporating an a priori estimation of the dataset's noise level, $\tilde\eta$, directly into its regularization term. The IDAC loss functions is defined as:
\begin{equation}\label{eq:idac}
    \mathcal{L}_{IDAC}(x_j)=(1-p_{k+1})\left(-\sum^k_{i=1} t_i\log\frac{p_i}{1-p_{k+1}}\right)+\alpha(\tilde\eta-\hat\eta)^2
\end{equation}
The key innovation lies in replacing DAC's incremental penalty with a term that minimizes the divergence between the expected noise rate $\tilde\eta$ and the model's current batch-wise abstention rate, $\hat\eta$. This provides a more direct supervisory signal, guiding the model to abstain on a fraction of samples that is consistent with the known level of label corruption \cite{schneider2024informed}.

While DAC and IDAC have demonstrated the profound effectiveness of abstention, their application has been confined to the CE loss paradigm. Our work addresses this limitation by proposing a generalized abstention framework, establishing it as a modular tool to enhance the robustness of a diverse range of loss functions.