




\section{Experiments}

\paragraph{{Kidney stone dataset.}} This in-house dataset \cite{deol2024mp07} consists of 38 fiberoptic and 98 digital endoscopy videos. We extracted frames at 3 FPS, resulting in a total of 21,718 labeled frames. We partitioned the data at the video-level, yielding approximately a 75/5/20\% split for training/validation/testing. While all videos contain kidney stones, some individual frames may not, which introduces an implicit detection challenge in addition to segmentation. The dataset exhibits substantial variation in image quality due to the complex in vivo surgical environment (Fig.~\ref{dataset}, Appendix.~\ref{challenging}), such as rapid motion, debris and fluctuating lighting conditions. All images are resized to $256\times256$. Detailed information about the problem setting and key challenges is provided in Appendix~\ref{challenging}.


\paragraph{{Polyp colonoscopy dataset.}}
PolypGen~\cite{ali2023multi} is a public colonoscopy dataset collected from six imaging centers. It contains 1,537 single-labeled frames, which are discretely sampled and focus on polyp-present images, and 2,225 sequence-labeled frames sampled from short video clips, which may include both polyp-present and polyp-absent
views. The sequence setting is more challenging due to larger appearance
variation, motion blur, and frequent polyp-absent frames. Following the benchmark \cite{ali2023multi}, we train on frame data from centers 1-5 and evaluate on center 6 for both frame and sequence data.
All images are resized to $512\times512$. Detailed information is provided in Appendix~\ref{challenging}.

% \paragraph{{Polyp colonoscopy dataset.}} PolypGen \cite{ali2023multi} is a publicly available multi-center dataset with 1,537 single-labeled frames (discrete sampling) and 2,225 sequence-labeled frames (short clips) collected from six different imaging centers. Following the benchmark study \cite{ali2023multi}, we use data from centers 1–5 for training and test on center 6. We resize images to $512\times 512$.

\paragraph{Implementation details.} During training, we set the $L_s$ and $L_p$ as naive binary cross entropy loss with a batch size of 16 for 200 epochs. The initial learning rate is $10^{-4}$ with a cosine curve decay to $10^{-5}$. Our study was conducted on an NVIDIA A6000.
% We will release the code along with logs for reproducing at \url{https://github.com/MedICL-VU/Endo-SemiS}.
% \noindent \textbf{Compared methods.} We compared to several state-of-the-art semi-supervised \cite{wang2024allspark,sohn2020fixmatch,yang2023revisiting,tarvainen2017mean,luo2022semi-uncertainty,chen2021semi,luo2022semi-crossteaching} methods, including single-network  \cite{wang2024allspark,sohn2020fixmatch,yang2023revisiting,tarvainen2017mean} and cross-supervision \cite{chen2021semi,luo2022semi-crossteaching} methods, including some incorporating transformers \cite{luo2022semi-crossteaching,wang2024allspark}. These methods focus on different uncertainty modeling, such as aleatoric uncertainty \cite{wang2024allspark,sohn2020fixmatch,yang2023revisiting} and epistemic uncertainty \cite{yang2023revisiting,tarvainen2017mean,luo2022semi-uncertainty}. Furthermore, they employ pseudo-labeling \cite{sohn2020fixmatch,yang2023revisiting,chen2021semi,luo2022semi-crossteaching,wang2024allspark} and uncertainty-guided self-consistency \cite{tarvainen2017mean,luo2022semi-uncertainty} to improve learning stability and reliability. 


\paragraph{Compared methods.}
We compare to several state-of-the-art semi-supervised learning methods, including 
Generic \cite{bellver2019budget}, AllSpark \cite{wang2024allspark},
UPRC \cite{luo2022semi-uncertainty},
FixMatch \cite{sohn2020fixmatch}, UniMatch \cite{yang2023revisiting}, Mean Teacher \cite{tarvainen2017mean},
Cross-Pseudo Supervision (CPS) \cite{chen2021semi} and Cross Teaching \cite{luo2022semi-crossteaching}. For polyp datset, we additionally compare state-of-the-art polyp segmentation methods (PNS+~\cite{ji2022video} and DSHNet~\cite{wang2025dynamic}) as well as a lightweight CNN (EfficientNet \cite{tan2019efficientnet}). Further details are provided in Appendix~\ref{appendix:compared methods}.

% These methods can be categorized into single-network (Generic, AllSpark, UPRC, FixMatch, UniMatch, Mean Teacher) and cross-supervision (CPS and Cross Teaching) methods, and some of these approaches  incorporate transformer-based architectures, such as Cross Teaching, AllSpark. These methods explore different forms of uncertainty modeling, including aleatoric uncertainty (AllSpark, FixMatch, UniMatch) and epistemic uncertainty (UniMatch, MeanTeacher, UPRC). Most approaches rely on pseudo-labeling (FixMatch, UniMatch, CPS, CrossTeaching, AllSpark) and uncertainty-guided self-consistency mechanisms (MeanTeacher, UPRC) to improve learning stability and reliability. We implemented these methods with their official code repositories. Further details on the category classification of the compared methods are provided in Appendix~\ref{appendix:compared methods}.


\paragraph{Evaluation metrics.}
We report pixel-level segmentation performance using Dice, sensitivity, and specificity. We also evaluate image-level target presence detection by converting each predicted mask into a binary image label. An image is predicted positive if any foreground pixel is present and negative otherwise. The precision, recall, F1-score, and accuracy are computed at the image level. These metrics indicate whether the model detects the presence or absence of the target object, independent of pixel-wise overlap quality. 

%\noindent\textbf{Evaluation metrics.} 


%A detailed categorization of the compared methods is provided in Appendix Table~\ref{tab:app:method_summary}.


% \xx{Predicted masks are binarized using a fixed threshold of 0.5 on the probability map. An image is considered positive if any foreground is present. This reflects the ability of the model to detect the presence or absence of target objects at the image level, independent of pixel-wise accuracy.}




\input{tables/table1}


%\noindent \textbf
\paragraph{Segmentation performance.} The quantitative results of the kidney stone dataset using 10\% labeled data are shown in Tab.~\ref{main_table}. The Generic model underperforms compared to supervised learning, which highlights the critical role of pseudo-label quality in semi-supervised segmentation.
% This is also partially supported by the UPRC results, which show that self-guided uncertainty struggles when the network fails to generate reliable pseudo-labels independently.
In contrast, the results of Mean Teacher, UniMatch, and FixMatch show that incorporating external uncertainty  improves segmentation, especially for UniMatch where epistemic uncertainty is also leveraged. The results of AllSpark indicate that transformer-based method struggles for kidney stone segmentation, where image quality is variable (Fig.~\ref{qualitative}, enlarged viewed in Fig.~\ref{qualitative_zoom}). Cross-supervision methods (lavender) achieve better performance than single-network-based methods (blue), demonstrating better generalizability. 
% Although our method achieves accurate segmentation, as evidenced by sensitivity, it comes with a trade-off of increased false positives, as reflected in the specificity and precision results. However, 
Endo-SemiS achieves substantially superior performance across most metrics compared to these SOTA semi-supervised methods. Notably, it even outperforms supervised methods trained on full labeled data (upper bound, green). 


\begin{figure}[t]
\centering
\includegraphics[width=\linewidth]{figures/qualitative.pdf}
\caption{Qualitative kidney stone results (10\% labeled data). Yellow circles highlight poor visibility areas. 
Enlarged, high-resolution views are shown in Fig.~\ref{qualitative_zoom}.
}
\label{qualitative}
\end{figure}




\paragraph{Consistency analysis.} In Tab.~\ref{ratio_table}, we present consistency results in two aspects: (1) robustness across different ratios of labeled training data, and (2) consistency between models within the framework. Endo-SemiS maintains stable performance across different ratios, demonstrating particularly robust performance when labeled data is extremely limited (only 1\%). The performance of the two cross-supervised models of our framework is more consistent and reliable than the compared methods.
% This is important for clinical applications, where selecting between models can be challenging, and model ensembling may even degrade performance, as shown by the results of Cross-Teaching. Additionally, deploying an extra model in real-time applications such as kidney stone lithotripsy is impractical, as the network must provide precise segmentation in real-time to assist surgeons effectively.
Considering the challenging visibility conditions in kidney stone surgery (Fig.~\ref{qualitative}), consistency is crucial to performance because inaccurate  pseudo-labels can severely degrade segmentation results.
Finally, we observe that our ST corrective model improves performance across all label ratios.




\input{tables/table3}
\input{tables/table4}


\paragraph{Ablation analysis.} Tab.~\ref{ablation_table} shows the ablation study, where CPS is used as the baseline method, and the improvements for each added component are shown. Importantly, joint pseudo-label supervision (JPS) yields a larger improvement, which indicates that it effectively removes uncertain regions and generates high-quality pseudo-labels for supervision, especially for strong augmented images.
Although multi-level mutual learning slightly decreases the performance, it improves consistency.



\input{tables/kidney_size_table}


\paragraph{Kidney stone size analysis.}
We stratify the test set (n=3959) by kidney stone size using the ground truth mask area relative to the image area. Table.~\ref{stone_size_table} shows that Endo-SemiS achieves the best semi-supervised Dice in the Small/Medium/Large groups and the best overall Dice. Among SSL methods, the largest improvement is for large stones. In intra-operative settings, close-range views are common and often involve ongoing ablation, which increases boundary ambiguity and makes segmentation more challenging (Fig.~\ref{dataset_zoom}). Compared to cross-supervised methods, our method still shows a clear improvement on small stones, suggesting better robustness on challenging small-region segmentation, where limited pixel support makes predictions sensitive to noise. Overall, these results show that the proposed uncertainty-guided learning remains effective across stone sizes. However, the ``No Stone" group Dice suggests occasional false positives.



\input{tables/table2_rebuttal}


\paragraph{Generalizability analysis.}
We further evaluate Endo-SemiS on the PolypGen cross-center setting (Tab.~\ref{polyp_table_rebuttal}) to explicitly assess robustness to domain shift.

With only 10\% labeled data, key findings are:
(1) supervised training can perform well on single-frame evaluation but often degrades noticeably on sequence evaluation in the cross-center setting, indicating limited robustness when models are trained on frames but tested on sequences under domain shift with substantial appearance changes.
(2) Using the same U-Net backbone, Endo-SemiS consistently improves over supervised training and SSL methods on both single-frame and sequence evaluation, demonstrating stronger generalization despite these appearance differences across domains, and the detailed uncertainty analysis can be viewed in Appendix \ref{uncetainty analysis}.
(3) While a stronger backbone improves single-frame performance, sequence predictions can become less stable under domain shift. Endo-SemiS alleviates this issue with more reliable sequence predictions. (4) Endo-SemiS is model-agnostic and can benefit from a stronger backbone without sacrificing its single-frame advantages, while mitigating its weaknesses on sequence evaluation by producing more stable predictions. (5) A lightweight CNN (EfficientNet) shows limited performance under supervised training with limited labels. Yet when trained with Endo-SemiS, it achieves competitive sequence performance comparable to heavier backbones while exhibiting the smallest gap between single-frame and sequence evaluation. This further validates the model-agnostic property of our methods.


% \section{Conclusion} In this study, we propose \textbf{Endo-SemiS} for robust endoscopic segmentation using semi-supervised learning. It uses an uncertainty-guided pseudo-label strategy, cross- and joint-supervision, and multi-level mutual learning. We demonstrate the state-of-the-art performance of Endo-SemiS in experiments on two endoscopy datasets with varying image quality. The proposed spatiotemporal corrective model can further improve the segmentation performance.


%\section{Conclusion}
% In this study, we propose \textbf{Endo-SemiS} for robust endoscopic segmentation via semi-supervised learning under limited annotation.   Endo-SemiS extends cross-supervision by integrating uncertainty-guided pseudo-label generation, joint pseudo-label supervision, and multi-level mutual learning to improve training stability and pseudo-label reliability.  We evaluate Endo-SemiS on two clinical endoscopy applications, kidney stone laser lithotomy from ureteroscopy and polyp screening from colonoscopy, using two datasets with challenging image quality.  Compared to state-of-the-art methods, Endo-SemiS achieves superior segmentation performance, indicating improved robustness and generalization under challenging endoscopic conditions. In addition, a spatiotemporal corrective network further improves performance by leveraging inter-frame information.  Future work will apply Endo-SemiS to additional endoscopic procedures and broader domain shifts, \xx{as well as larger dataset}, and will further incorporate temporal information into the semi-supervised learning framework.  
%\section{Conclusion}
% In this study, we propose \textbf{Endo-SemiS} for robust endoscopic segmentation using semi-supervised learning. It uses an uncertainty-guided pseudo-label strategy, cross- and joint-supervision, and mutual learning. We demonstrate the state-of-the-art performance of
% Endo-SemiS in experiments on two endoscopy datasets with varying image quality. 
% % The
% % proposed spatiotemporal corrective model can further improve the segmentation performance.
% Future work will apply Endo-SemiS to additional dataset, and will further incorporate temporal information into the semi-supervised learning framework.  



\section{Conclusion}

In this study, we propose \textbf{Endo-SemiS} for robust endoscopic segmentation using semi-supervised learning. It uses an uncertainty-guided pseudo-label strategy, cross- and joint-supervision, and mutual learning, and achieves strong performance on two endoscopy datasets with substantial variations in image quality. Endo-SemiS can still fail when the model produces confidently wrong predictions that pass the uncertainty filter (Appendix~\ref{limitation}), and we observe residual false positives in stone-absent frames (Tab.~\ref{stone_size_table}), reflecting the coupled detection-segmentation challenge in videos that contain negative frames despite stone-positive videos. Future work will validate Endo-SemiS on additional datasets (e.g., SunSeg \cite{ji2022video}) and further incorporate temporal information into the semi-supervised learning framework.




