

% \paragraph{sharp}
% \paragraph{real-time}
% \paragraph{range?}



\section{Results}
\paragraph{Phantom colonoscopy depth dataset (C3VD).}
\label{C3VD dataset}
This dataset \cite{bobrow2023} is a widely used colonoscopy depth benchmark. It contains 22 video sequences that are registered to generate 10,015 frames with paired ground-truth depth maps. The depth values represent distance along $z$-axis of the camera frame and are clamped to the range 0–100 $\mathrm{mm}$. The dataset includes four colon segments: cecum, descending colon,
sigmoid colon, and transverse colon.  We define two evaluation splits to assess both overall performance and generalization. \underline{The first} is a domain-shift split, where videos from the transverse colon are held out for testing to evaluate cross-organ generalization. \underline{The second} is an in-distribution split, where training and test videos are drawn from all four colon segments, following prior work \cite{paruchuri2024leveraging}. Detailed split definitions are provided in Appendix~\ref{Split}. This phantom dataset is our primary benchmark and is used for method development.


\paragraph{Simulated colonoscopy depth dataset (SimCol3D).}
It consists of 33 videos with 37,800 frames and paired depths \cite{rau2024simcol3d}. This dataset was used in a MICCAI challenge, and we follow the official training and evaluation splits. In our study, SimCol3D is only used for evaluation instead of development.

\paragraph{Implementation details.} During training, we resize all images to $518 \times 518$ (C3VD) and $476 \times 476$ (SimCol3D), and use the AdamW optimizer with learning rates of $5\times10^{-6}$ for encoder and $5\times10^{-5}$ for decoder. We train for 15K iterations with a batch size of 4 and a temporal window of 5 frames. All experiments are conducted on an NVIDIA A6000 GPU.


\input{tables/main_table}


\paragraph{Compared methods.}
%\noindent\ul{\textbf{Compared methods.}}
We compare to several state-of-the-art depth estimation methods, including a foundation depth model, DepthAnything v2 \cite{yang2024depthv2} from natural image domain, EndoOmni \cite{tian2024endoomni} for medical endoscopic images. In addition, we also compare against the recent video depth estimation method FlashDepth~\cite{chou2025flashdepth}. Motivated by the strong performance of DINOv3 on monocular depth estimation~\cite{simeoni2025dinov3}, we further adapt DINOv3 for medical depth estimation in comparison. All competing methods are implemented using their official code repositories.

For C3VD benchmarking, we compare against PPSNet \cite{paruchuri2024leveraging}, which estimates depth using near-field illumination modeling and a teacher–student framework. In addition, we report the top-performing methods from the official SimCol3D challenge \cite{rau2024simcol3d} on the unseen test sequence (SimCol III) as the comparison.

\paragraph{Evaluation metrics.} We report standard depth metrics, including absolute relative error (AbsRel), squared relative error (SqRel), root-mean-squared error (RMSE), RMSE in log space (RMSE log), mean absolute error (L1), and the accuracy under threshold ($\delta_1<1.25$). To assess spatial detail, we compute the boundary F1 score between predicted and ground-truth depth edges. We additionally report a frame variance metric ($\sigma$) that quantifies the temporal stability of depth predictions across frames (details in Appendix~\ref{Evaluation metric}). 












%\subsection{Analysis for EndoStreamDepth}

%\paragraph{Overall performance.}
\paragraph{Overall performance.} The quantitative results of C3VD dataset (\underline{split 1}) are shown in Tab.~\ref{main_table}(top section). The DepthAnything v2 and its metric version \cite{yang2024depthv2} show superior performance to other state-of-the-art foundation models for predicting depth maps. These are observed both from relative metrics ($\delta_1$ and AbsRel) and distance metrics (RMSE and L1). However, FlashDepth produces sharper depth maps (F1 score) than the other compared methods. The proposed method, EndoStreamDepth, substantially outperforms these compared methods in global geometry, distances, and anatomical edges. The qualitative results are shown in Fig.~\ref{qualitative}, where our method exhibits small errors for near- and far-range, and sharp depth, as evidenced by the edge maps (also see Fig.~\ref{sharp depth map}).



\begin{figure}[t]
\centering
\includegraphics[width=0.8\linewidth]{figures/qualitative.pdf}
\caption{
Qualitative results. From top to bottom: AbsRel error maps, predicted depth maps, and edge maps derived from the predicted depth map, cropped to the dashed line for visualization. The yellow and green arrows indicate the far- and near-range errors. Red arrows highlight a defect on the edge maps. 
}
\label{qualitative}
\end{figure}









\paragraph{Temporal consistency and runtime.}
In Fig.~\ref{barplot}, we compare the per-video temporal variance $\sigma$ and inference speed in FPS on the C3VD (\underline{split 1}) for our method and FlashDepth. Across sequences, our method achieves consistently smaller $\sigma$ than FlashDepth on 8/9 videos, indicating that the proposed multi-level temporal modules and temporal regularization effectively suppress frame-to-frame variation. Additional evidence is provided in the ablation study visualization (Fig.~\ref{temporal_results}). As a trade-off, our method yielded lower FPS than FlashDepth (averaged as 24 vs. 36) across all videos. However, it maintains real-time throughput ($>$ 20 FPS), which is adequate for medical robot deployment. The details of runtime can be viewed in Appendix.~\ref{runtime}.




\begin{figure}[t]
\centering
\includegraphics[width=0.83\linewidth]{figures/std_step_fps_4bars_per_video_custom_colors.png}
 \caption{Per-video temporal variance and runtime on the C3VD dataset. For each sequence, bars show the frame-variance score $\sigma$ (left axis) and the inference speed in FPS (right axis) for our method and FlashDepth. Our model has smaller variances than FlashDepth, with the tradeoff of lower FPS.}
\label{barplot}
\end{figure}




\paragraph{Ablation study.} We select both Metric Depth anything v2 and FlashDepth as ablation baseline methods for the single frame-based and video-based depth networks, respectively. The ablation results are displayed in an accumulated way (Tab.~\ref{main_table}). We observe that replacing the $\ell_1$ loss with SiLog improves FlashDepth performance. We also found that our proposed simple yet effective Endoscopy-Specific Transformation (EST) dramatically improved these foundation models. Yet, this has not been widely discussed in other endoscopy depth estimation work. The foundation model with EST can serve as a strong baseline.



% Additionally, we employed two supervision signals to learn sharp metric depth maps from video streams. The metric loss slightly improved $\delta_1$ with trade-offs on other metrics, while the edge loss  further improved all metrics, particularly the distance metrics. Adding multi-level temporal modules reduces relative-scale (SqRel) and distance errors (RMSE and L1), but F1 decreases slightly. It can introduce flickering in far-range regions because the depth boundaries become too sharp (see Fig.~\ref{temporal_results}).

Additionally, we employed two supervision signals to learn sharp metric depth maps from video streams. The metric loss slightly improved $\delta_1$ with trade-offs on other metrics, while the edge loss further improved all metrics, particularly the distance metrics. With edge loss can introduce flickering in far-range regions because the depth boundaries become too sharp (see Fig.~\ref{temporal_results}). Adding multi-level temporal modules reduces relative-scale (SqRel) and distance errors (RMSE and L1), but F1 decreases slightly. This is because the propagated features are at a coarse scale and lack fine details. Without an explicit constraint to preserve these details, the decoded depth maps become over-smoothed, which reduces F1.


Furthermore, multi-scale supervision substantially improves $\delta_1$, AbsRel, SqRel, and RMSE log, as well as boundary sharpness (F1), with only minor trade-offs in RMSE and L1. Lastly, the proposed self-supervised temporal regularization improves most metrics across relative-scale and distance metrics, with minor decreases in RMSE-log and $\delta_1$. This slight decrease is minor compared to the improvements in relative-scale and distance metrics, which better serve medical robotics applications by ensuring accurate local geometry and globally consistent scale. More importantly, this final version preserves detailed boundary information while maintaining temporal consistency (see Fig.~\ref{temporal_results}).





\paragraph{EST ablation study.} 
\input{tables/ets_table}

In Tab.~\ref{ets_table}, we compare different EST variants built on FlashDepth and report both overall and near-range performance (depth $<3\mathrm{mm}$). Photometric transformations alone degrade performance, whereas geometric transformations improve results across metrics. Importantly, full EST yields the largest improvements in the near range, notably reducing $\mathrm{RMSE}_{\text{near}}$ from 2.789 to 1.715$\mathrm{mm}$. This improvement is especially critical for surgical robotics, where accurate near-range depth supports safe instrument–tissue interaction, indicating that the combined transformations better model near-range perturbations. Since EST is a modality-agnostic stochastic transformation with no learned parameters, it can potentially be directly applied to other endoscopic imaging modalities.







\paragraph{Temporal analysis.}

\input{tables/mamba_ablation}

Tab.~\ref{tab:mamba_ablation} shows the placement of temporal Mamba modules across feature pyramid levels (L1=finest, L4=coarsest). Placing Mamba only at mid-levels (L2-L3) produces the worst F1 score due to missing local boundary information. Using only the finest level (L1) preserves detailed local context (L1, F1) but lack of global detail, leading to higher AbsRel and worst temporal stability (F-Var.). Gradually adding coarser level Mamba further improves the AbsRel and gradually improves RMSE and F1. Our full 4-level design captures complementary information across scales, i,e, coarse levels for global temporal context and fine levels for boundary sharpness, achieving the best RMSE, F1 and temporal stability, while maintaining comparable performance for $\delta_1$ and L1. The improvements between last two rows in Tab.~\ref{tab:mamba_ablation} indicate the effectiveness of adding coarsest temporal Mamba module.



\paragraph{Window size analysis.}

\input{tables/window_size_analysis}

We analyze the effect of temporal window size on depth estimation quality. As shown in Tab.~\ref{tab:window_ablation}, a shorter window yields higher frame variance, likely due to limited temporal context for enforcing consistency. Increasing the window to 10 reduces frame variance but degrades boundary sharpness (lower F1). A window size of 5 provides the best overall trade-off, achieving the lowest frame variance and the highest F1.




\input{tables/benchmark}

\paragraph{Benchmarking.} To validate the effectiveness of our model, in Tab.~\ref{benchmark_table} we compare it with two public benchmarks for depth estimation using C3VD and SimCol3D. The results of our single-frame network with the proposed EST are comparable to those of the top-performing comparison methods. The proposed EndoStreamDepth achieved the best results except for the AbsRel in C3VD. This suggests our model shows good generalizability between endoscopic datasets. These benchmarking comparisons also suggest that our methods, either frame- or video-based, can be used as robust methods for endoscopy depth estimation.







\section{Conclusion}
In this paper, we propose EndoStreamDepth, a monocular depth estimation framework for endoscopic video that produces accurate, temporally consistent, and sharp depth maps while maintaining real-time throughput. Our method outperforms state-of-the-art baselines on public depth datasets. More importantly, EndoStreamDepth addresses a key limitation of current methods for real-time applications, which rely on batched multi-frame processing for endoscopic video depth estimation. By processing frames as a stream, EndoStreamDepth reduces latency and satisfies real-time requirements. Future work will extend to other applications of robot-assisted endoscopic intervention and further leverage proprioception \cite{jordan2025probemde} to improve performance in areas with relatively larger errors. 



