\section{Experimental Setup}\label{sec:exp}
We validated our framework on two publicly available surgical datasets with distinct characteristics to demonstrate the generalizability of our approach.
\begin{itemize}
    \item Cataract Dataset for Image Segmentation (CaDIS): A benchmark featuring 4,670 frames from cataract surgery with dense, high-quality pixel-wise annotations \cite{grammatikopoulou2021cadis}. For our experiments, we utilized the 8-class variant, which groups surgical tools into a single class. Images were normalized and resized to $480\times256$.
    \item Dresden Surgical Anatomy Dataset (DSAD): A more complex benchmark with 1,430 frames from laparoscopic surgery \cite{carstens2023dresden}. This dataset presents a greater challenge due to its intricate anatomical structures, sparse annotations (approx. 82\% background), and significant class imbalance. Images were normalized and resized to $480\times384$.
\end{itemize}

To rigorously assess noise robustness, we simulated realistic annotation errors with a two-pronged approach. Structural noise was introduced via morphological transformations (erosion and dilation) to simulate boundary inaccuracies, while semantic noise was injected via stochastic label flipping to mimic annotator bias \cite{karimiDeepLearningNoisy2020, zhangDisentanglingHumanError2020, marcinkiewiczQuantitativeImpactLabel2019, liSemiSupervisedSemanticSegmentation2023}. 
Visualizations demonstrating the realism and severity of these structural and semantic corruptions, including difference maps between ground truth and noisy annotations, are provided in \appendixref{app:noise}.

We evaluated performance across five calibrated noise levels for each dataset: 5-25\% corruption for CaDIS and 3-15\% for DSAD. Specifically, we define the noise rate $\eta$ as the global percentage of pixels in the dataset where the noisy annotation mask differs from the ground truth mask (i.e., $\frac{1}{N}\sum \mathds{1}(\hat{y} \neq y) \approx \eta$).

We used an NVIDIA A100 80GB to train a U-Net model \cite{ronneberger2015u} with a pretrained ResNet-50 \cite{he2016deep} backbone for our experiments. Key training hyperparameters are detailed in \appendixref{app:train_params}. To ensure statistical reliability, each experiment was conducted five times with distinct random seeds. Hyperparameters for each loss function were optimized to yield the highest validation mean Intersection over Union (mIoU) on the highest noise level for each dataset, thereby maximizing noise resistance. Crucially, to ensure a fair comparison and rigorously isolate the impact of our abstention framework, the optimal hyperparameters found for the baseline GCE and SCE functions were deliberately held constant for their respective abstaining versions, GAC and SAC. While likely suboptimal for our novel functions, this methodology ensures that any observed performance gains are attributable solely to the abstention mechanism itself. \appendixref{app:loss_params} details the hyperparameters we used for each loss in our benchmarks.