



\section{Introduction}
Endoscopic image segmentation poses unique challenges, including large variations in image quality and appearance, which may be caused by motion blur, fluctuating lighting conditions \cite{li2025automated}, and often fluid-filled environments \cite{setia2023computer}, as well as domain shifts \cite{ali2023multi}. These effects are illustrated in Fig.~\ref{dataset}, which shows blur, bleeding, debris, occlusions, and cross-site or cross-device appearance changes in ureteroscopy and colonoscopy images. The limited availability of manual labels further complicates the task.


\begin{figure}[t]
\centering
\includegraphics[width=0.95\linewidth]{figures/dataset.pdf}
\caption{Challenging ureteroscopy (a–f, left) and colonoscopy (g–h, right) images for segmentation.
(a) irrigation; (b) bleeding; (c) motion blur;
(d) early ablation; (e) mid ablation;
(f) late ablation. (g) and (h) are from the public dataset \cite{ali2023multi}, which is collected from multiple imaging sites. Enlarged views in Fig.~\ref{dataset_zoom}}.

\label{dataset}
\end{figure}



Semi-supervised learning (SSL) approaches provide a potential solution by effectively leveraging information from unlabeled data \cite{sohn2020fixmatch,chen2021semi,luo2022semi-crossteaching,luo2022semi-uncertainty,yang2023revisiting,tarvainen2017mean,wang2024allspark}. These methods construct supervision signals for unlabeled samples from the predictions of the model itself. A key approach to achieving this is enforcing consistency constraints \cite{tarvainen2017mean}, either through uncertainty-guided self-regularization \cite{sohn2020fixmatch,yang2023revisiting,luo2022semi-uncertainty,wang2024allspark,tarvainen2017mean} or cross-supervision \cite{chen2021semi,luo2022semi-crossteaching} to improve the quality and reliability of pseudo-labels.






Based on these principles, SSL can be broadly categorized into single-network and dual-network frameworks. Single-network approaches enforce consistency under perturbations and regularize pseudo-labels based on uncertainty. \cite{sohn2020fixmatch,yang2023revisiting,wang2024allspark}. However, single model-based method tends to persist in its incorrect predictions, leading to error accumulation. Dual-network approaches maintain two networks that exchange pseudo-labels
for cross-supervision \cite{chen2021semi,luo2022semi-crossteaching} to mitigate confirmation bias \cite{arazo2020pseudo}. Building on this, numerous studies in medical imaging have achieved excellent segmentation performance  \cite{luo2022semi-crossteaching,luo2022semi-uncertainty,wang2023ssl2,yu2019uncertainty,lei2022semi}.


These existing SSL methods have some limitations: \textbf{(1)} Single-network methods lack model-level consistency, which makes them struggle with high-uncertainty samples. \textbf{(2)} Methods that either use the entire uncertainty map or apply a fixed uncertainty threshold treat many unreliable regions as confident, leading to false positives and overfitting to incorrect pseudo-labels. \textbf{(3)} Cross-supervision methods do not explicitly model uncertainty and struggle to filter out unreliable pseudo-labels. Since each model generates pseudo-labels independently, confirmation bias may occur when both models make similar wrong predictions. 


% \xx{\textbf{To the best of our knowledge, these limitations have not been systematically addressed for ureteroscopic kidney stone segmentation}, where severe variations in image quality and domain shift (Fig.~\ref{dataset}) make the task highly challenging and may lead to failure cases for existing methods.}



In this paper, we propose \textbf{Endo-SemiS}, a semi-supervised segmentation method to address the limitations of existing approaches in endoscopic imaging with robust outcomes. Specifically, to address each of these limitations: \textbf{(1)} Endo-SemiS adopts a cross-supervision framework (see Fig.~\ref{framework}(a)) to prevent biased learning \cite{chen2022debiased} and uses naive U-Net models to ensure real-time clinical applicability \cite{wei2021shallow,luo2019real} rather than relying on transformer-based models that may require heavy computation \cite{luo2022semi-crossteaching,wang2024allspark}. \textbf{(2)} To obtain reliable pseudo-labels for unlabeled data, a critical step in SSL \cite{wu2021semi}, we leverage both aleatoric and epistemic uncertainty (see Fig.~\ref{framework}(b)). Unlike existing fixed-threshold approaches \cite{sohn2020fixmatch}, a dynamic thresholding mechanism is applied per uncertainty map, ensuring that only high-confidence regions contribute to pseudo-label supervision. \textbf{(3)} To achieve accurate and consistent supervision, we introduce a joint pseudo-labeling strategy as shown in Fig.~\ref{framework}(c), where supervision is guided by the predictions in the lowest uncertainty regions identified by both networks, and pixels that are classified as uncertain are excluded. \textbf{(4)} We design multi-level mutual learning (see Fig.~\ref{framework}(d)) between networks to further mitigate confirmation bias and improve consistency between networks for producing reliable pseudo-labels. Our main contributions are:

\begin{itemize}
    \item We propose an uncertainty-guided pseudo-labeling approach within a cross-supervision framework, which dynamically filters out unreliable regions for each image and provides more reliable segmentation supervision from unlabeled endoscopic frames.
    \item We introduce a consistency-focused learning framework with joint pseudo-label supervision and multi-level mutual learning. The more reliable prediction between the two networks is selected as supervision, while mutual learning reduces unnecessary prediction variance in confident regions and leads to more stable pseudo-labels.
    \item We design a plug-and-play correction model that uses spatiotemporal information from video to refine segmentation and can be easily integrated into other frameworks.
\end{itemize}



We validate Endo-SemiS on kidney stone laser lithotripsy as a challenging primary task and on polyp screening across different centers to demonstrate generalizability. Our comprehensive evaluation shows consistent improvements over state-of-the-art methods.
