

\clearpage

\section{Dataset Challenges}
\label{challenging}
\begin{figure}[t]
\centering
\includegraphics[width=\linewidth]{figures/dataset_new.pdf}
\caption{Challenging ureteroscopy (a–f, left) and colonoscopy (g–h, right) images for segmentation.
(a) irrigation; (b) bleeding; (c) motion blur; (d) early ablation; (e) mid ablation; (f) late ablation. The arrow indicates the target kidney stone for ablation. (g) and (h) are from the public dataset \cite{ali2023multi}, which is collected from multiple imaging sites.}
\label{dataset_zoom}
\end{figure}


Fig.~\ref{dataset_zoom} provides representative challenging examples from both datasets. For the kidney stone ureteroscopy data (a--f), the in vivo surgical environment introduces strong image-quality degradation, including irrigation flow, bleeding, rapid camera motion, and ablation-induced debris/bubbles and specular highlights. Panels (d-f) illustrate a typical \textbf{localization-to-ablation} workflow: (d) the surgeon first \textbf{locates} the target stone (arrow), (e) the scope then moves closer and \textbf{ablation begins}, and (f) shows the view \textbf{after ablation}. During this transition, the stone may remain only partially visible and become increasingly affected by blood, movements, debris, and lighting changes as the scope approaches, resulting in ambiguous boundaries and occasional stone-absent frames despite stone-positive videos. For the PolypGen colonoscopy data (g-h), cross-center acquisition yields noticeable appearance shifts (color/texture/illumination), and the sequence clips often contain motion blur and frequent polyp-absent frames, which effectively couples segmentation with an implicit detection challenge. Overall, the figure summarizes the main sources of difficulty that our method targets: cross-site appearance variation and severe, procedure-specific artifacts.








\section{Compared methods}
\label{appendix:compared methods}
We compare our method with several state-of-the-art semi-supervised approaches \cite{wang2024allspark,sohn2020fixmatch,yang2023revisiting,tarvainen2017mean,luo2022semi-uncertainty,chen2021semi,luo2022semi-crossteaching}. 
These methods cover both single-network \cite{wang2024allspark,sohn2020fixmatch,yang2023revisiting,tarvainen2017mean} and cross-supervision \cite{chen2021semi,luo2022semi-crossteaching} frameworks, with and without transformer backbones \cite{luo2022semi-crossteaching,wang2024allspark}. 
They focus on different uncertainty modeling strategies, including aleatoric \cite{wang2024allspark,sohn2020fixmatch,yang2023revisiting} and epistemic \cite{yang2023revisiting,tarvainen2017mean,luo2022semi-uncertainty} uncertainty, and combine confidence-based pseudo-labeling \cite{sohn2020fixmatch,yang2023revisiting,chen2021semi,luo2022semi-crossteaching,wang2024allspark} with uncertainty-guided self-consistency \cite{tarvainen2017mean,luo2022semi-uncertainty}. 
For completeness, we summarize the main characteristics of each method below.


\begin{itemize}

\item \textbf{AllSpark} \cite{wang2024allspark}:
Single-network transformer-based semi-supervised semantic segmentation method built on a standard pseudo-labeling baseline. It inserts an AllSpark bottleneck between the encoder and decoder, where channel-wise cross-attention and a class-wise semantic memory reconstruct labeled features from unlabeled features to strengthen supervision.
It was published at \textit{CVPR} 2024.


\item \textbf{Uncertainty-Rectified Pyramid Consistency (URPC)} \cite{luo2022semi-uncertainty}:
It is a single-network pyramid-prediction framework for semi-supervised medical image segmentation.
The model produces multi-scale predictions and, for unlabeled data, enforces consistency between each scale and their average prediction.
Pixel-wise uncertainty is estimated from the discrepancy among scales in a single forward pass and is used both to weight the pyramid consistency loss and to impose an uncertainty-minimization regularizer, enabling more reliable use of unlabeled images. It was published at \textit{Medical Image Analysis} 2022.

\item \textbf{FixMatch} \cite{sohn2020fixmatch}:
Single-network method with a CNN backbone that combines consistency regularization and pseudo-labeling. 
For each unlabeled image, it takes the prediction on a weakly augmented view, keeps it as a qulified pseudo-label only if its confidence exceeds a fixed threshold, and trains the model to match this pseudo-label on a strongly augmented view of the same image. It was published at \textit{NeurIPS} 2020.


\item \textbf{UniMatch} \cite{yang2023revisiting}:
Single-network method with a CNN backbone that revisits FixMatch for semi-supervised semantic segmentation.
It maintains weak-strong consistency using fixed confidence-thresholded pseudo-labels from the weakly augmented image, and introduces unified perturbations that operate at both the image level (strong augmentations) and the feature level (dropout), together with two strongly augmented images guided by the same weak prediction, to better exploit the perturbation space. It was published at \textit{CVPR} 2023.


\item \textbf{Mean Teacher} \cite{tarvainen2017mean}: 
Teacher-Student framework with a single concolutional neural network (CNN) backbone. The student is trained on labeled data, and an exponential moving average (EMA) of the student weights defines the teacher. For unlabeled data, a consistency loss enforces that the student prediction matches the teacher prediction under stochastic perturbations. This can be viewed as reducing epistemic uncertainty. It was published at \textit{NeurIPS} 2017.



\item \textbf{Cross Pseudo Supervision (CPS)} \cite{chen2021semi}:
Cross-supervision semi-supervised semantic segmentation framework in which two segmentation networks with the same architecture but different initializations are trained jointly. For both labeled and unlabeled images, the prediction from each network is used as a pseudo label to supervise the other, enforcing prediction consistency and effectively expanding the training data.
It was published at \textit{CVPR} 2021.




\item \textbf{Cross Teaching between CNN and Transformer (Cross Teaching)} \cite{luo2022semi-crossteaching}:
Cross-supervision semi-supervised segmentation framework that pairs a CNN (UNet) and a Transformer (Swin-UNet).
On unlabeled images, each network takes the prediction from the other network as a pseudo-label and is optimized with a cross-teaching Dice loss, providing implicit consistency while exploiting the complementary local and long-range representations of CNNs and transformers.
It was published at \textit{MIDL} 2022.


\end{itemize}


For polyp segmentation task, we also compare with fully-supervised methods that address complementary challenges: temporal modeling for video data (PNS+ \cite{ji2022video}), frequency-domain feature learning (DSHNet \cite{wang2025dynamic}), and efficient backbone design (EfficientNet \cite{tan2019efficientnet}).

\begin{itemize}


\item \textbf{PNS+}: \cite{ji2022video} It is a video polyp segmentation method that models both long-term and short-term spatial-temporal dependencies via a global-to-local learning strategy. It employs a global encoder to extract anchor frame features and a local encoder to process consecutive frames within a sliding window, with two normalized self-attention (NS) blocks progressively refining spatial-temporal representations. The NS block uses channel splitting, query-dependent relevance measuring, and layer normalization to efficiently capture neighborhood correlations across frames. It was published at \textit{Machine Intelligence Research}, 2022.


\item \textbf{DSHNet}: \cite{wang2025dynamic} It is a dynamic spectrum-driven hierarchical learning network for polyp segmentation. It decomposes images into high-frequency and low-frequency components via Discrete Cosine Transform. Specifically, high-frequency features enhance boundary details through skip connections, while low-frequency features guide the generation of dynamic convolution kernels for region-level saliency modeling. The method divides images into polyp interior, boundary, and background regions, applying region-specific kernels to handle polyp heterogeneity. It was published at \textit{Medical Image Analysis} 2025.


\item \textbf{EfficientNet}: \cite{tan2019efficientnet} It proposes a compound scaling method that uniformly scales network depth, width, and resolution using a set of fixed scaling coefficients. Starting from a baseline network discovered via neural architecture search (NAS), it achieves superior accuracy-efficiency trade-offs compared to previous ConvNets. The compound scaling strategy enables more balanced resource allocation across different network dimensions than single-dimension scaling approaches. It was published at \textit{ICML}, 2019.


\end{itemize}





\section{Dynamic Uncertainty Thresholding Analysis}
\label{uncetainty analysis}


Table~\ref{tab:sensitivity_combined} investigates the sensitivity of Endo-SemiS to the uncertainty-mask thresholding rule and the number of MC-dropout passes $K$ on PolypGen (10\% labeled). The two fixed thresholds capture complementary behaviors. The $\mu+\sigma$ rule is more conservative and prioritizes reliability by filtering high-uncertainty pixels, while the percentile rule $P_{95}$ is often more permissive and retains more pixels, which can increase pseudo-label coverage. Our method combines these two fixed choices via a dynamic threshold
\[
T=\min\!\big(\mu(U)+\sigma(U),\, P_{95}(U)\big),
\]
which automatically selects the stricter criterion per image. This inherits the stability of $\mu+\sigma$ when uncertainty is high and still preserves the coverage benefit of $P_{95}$ when the uncertainty distribution is well-behaved, yielding a better reliability-coverage balance for pseudo labels. Consistent with this design, the combined rule achieves the best overall results in Table~\ref{tab:sensitivity_combined}, including the top single-frame Dice for EndoSemi-S-2 (79.4) and strong sequence performance. Increasing $K$ from 5 to 10 can further improve sequence Dice not for frame Dice. While larger $K$ does not provide consistent additional improvements, suggesting that moderate $K$ is sufficient.

\input{tables/threshold_table}

% \section{Run time analysis}
% \xx{MC Dropout is used only during training to estimate uncertainty for pseudo-label weighting; it is not required for deployment and is disabled at test time.
% Therefore, inference uses a single deterministic forward pass (standard dropout-off evaluation), identical to a supervised baseline, with no additional latency from uncertainty estimation.}

% \begin{table}[t]
% \centering
% \small
% \caption{Deployment inference speed on an NVIDIA RTX A6000.}

% \label{tab:runtime}
% \begin{tabular}{lcc}
% \toprule
% Resolution & FPS $\uparrow$ & Latency (ms) $\downarrow$ \\
% \midrule
% $256\times256$ & 135.5 & 7.4 \\
% $512\times512$ & 72.6 & 13.8 \\
% \bottomrule
% \end{tabular}
% \end{table}

% \xx{Both settings exceed the typical 25-30 FPS requirement for real-time endoscopy.
% For reference, if the task requires to compute uncertainty via MC Dropout (or applies test-time augmentation) at inference, the cost increases because it requires multiple forward passes per image: MC Dropout with $T{=}5$ is approximately a $5\times$ slowdown, and $N$-view test-time augmentation is approximately an $N\times$ slowdown. These are optional analysis-time settings. Our reported deployment runtime uses a single deterministic forward pass (no MC Dropout, no test-time augmentation).}


\section{Failure Case Analysis}
\label{limitation}
\begin{figure}[t]
\centering
\includegraphics[width=0.8\linewidth]{figures/failure_case.pdf}
\caption{Failure cases.
Each row shows an endoscopic frame, the ground-truth mask, the predicted binary image, the uncertainty map, and the probability map. For each section, the top and bottom panels correspond to Endo-SemiS-1 and Endo-SemiS-2, respectively. Color bars are normalized to $[0,1]$ (blue: low, red: high). Endo-SemiS produces  confidently wrong prediction.}
\label{limitation_figure}
\end{figure}

Fig.~\ref{limitation_figure} visualizes representative failure modes of our uncertainty-guided pseudo-labeling under severe intra-operative appearance shift (e.g., strong blur/out-of-focus, specular saturation, debris). In Case 1 (frame n) and Case 2 (frame n), the target is present in the ground truth but both networks predict all background. Importantly, the probability maps saturate toward background and the uncertainty (predictive entropy) remains near zero across the missed target region, with only weak increases near the predicted field-of-view boundary. Because our filtering rule relies on elevated uncertainty to reject unreliable pseudo-label pixels, these confidently wrong predictions are not filtered. Moreover, when both networks make the same confident error, the co-training signal provides no corrective disagreement, so cross-supervision cannot self-correct. Case 2 (frame m) further shows that uncertainty concentrates mainly at decision boundaries (higher entropy along edges) while remaining low in the interior, which limits its ability to flag globally unreliable predictions when the visual pattern is far from the training distribution.



\section{Qualitative results}
\begin{figure}[h]
\centering
\includegraphics[width=0.85\linewidth]{figures/qualitative_zoom.pdf}
\caption{Qualitative kidney stone results (10\% labeled data), larger version of Fig.~\ref{qualitative} for better visualization. Yellow circles highlight poor visibility areas. (a) fiberoptic frames, (b) digital frames, (c) fluid distortions,  (d) motion blur, (e) debris during stone ablation, and (f) illumination changes.}
\label{qualitative_zoom}
\end{figure}
