\section{Introduction}



Video depth estimation in monocular endoscopy provides geometric information for downstream tasks, such as 3D reconstruction \cite{recasens2021endo}, and supports automation in image–guided and robot–assisted interventions. These applications require depth maps that are accurate, temporally consistent across frames, and available in real time (20 FPS as suggested in \cite{kasimieh2025deep}) so that they can be used in the control loop of surgical robots and support reliable automation.



Existing vision foundation models for monocular depth estimation have achieved state-of-the-art performance \cite{yang2024depthv2,Bochkovskii2024,chen2025video,hu2025depthcrafter,shao2025learning}. 
However, single-frame models may produce flickering and inconsistent depths when they are applied to a video sequence due to the lack of temporal information \cite{yang2024depthv2,Bochkovskii2024}. 
Video-based models \cite{chen2025video} require processing batches of frames at once for better temporal consistency, which increases latency and makes them less suitable for real-time applications. 
Diffusion-based depth estimation methods can provide high-quality depth predictions \cite{hu2025depthcrafter,shao2025learning}, but they also take batches of frames as input and are slow at inference time due to the heavy computational load of diffusion models from iterative denoising steps. Although adapting these foundation models for endoscopic applications could achieve decent performance \cite{tian2024endoomni,paruchuri2024leveraging,cui2024endodac,zhou2025endodav,hardy2025coloncrafter},  \uline{they still suffer from fundamental real-time limitations for video depth estimation, because this setting requires low latency as well as causal, frame-by-frame streaming without waiting for future frames
}.



A very recent work, FlashDepth~\cite{chou2025flashdepth}, adapts a large depth foundation model with a temporal Mamba~\cite{dao2024transformers} layer to enable real–time streaming depth estimation with superior performance. However, it is designed for natural indoor and outdoor scenes and does not match the characteristics of endoscopic videos, where near–field lighting, specular reflections, sudden camera movement, rapid rotations, and motion blur cause large appearance changes. These factors limit its performance even with fine–tuning and produce inaccurate depth predictions, particularly with large near-range errors.

% Endoscopic videos from colonoscopy \cite{bobrow2023}, bronchoscopy \cite{tian2024endoomni}, and ureteroscopy \cite{lu2025kidney} often show a roughly circular lumen, but the underlying anatomy has very different physical diameters. A similar lumen size in the frame can therefore correspond to very different depth ranges across these procedures, and the lumen size also varies between patients and across anatomical regions.  

In addition, FlashDepth uses only a single $\ell_1$ loss on metric depth for supervision, so far-range errors are penalized more strongly than near-range errors. Due to limited temporal modeling capacity, caused by the lack of temporal consistency regularization and reliance on only a single temporal Mamba module, the predicted depth maps are not stable across frames and exhibit noticeable far-range flickering. Moreover, without an edge-aware supervision term, it fails to produce sharp depth maps in low-texture endoscopic frames.


To overcome these limitations, we propose \textbf{EndoStreamDepth}, a streaming monocular depth estimation framework for endoscopic video that produces accurate, temporally consistent, and sharp depth maps. It explicitly models endoscopy-specific geometric and photometric variations and uses comprehensive supervision to achieve robust performance. A hierarchical architecture with multi-level temporal modules that aggregate local and global information, together with a self-supervised temporal regularization term, leverages cross-frame information to maintain consistency throughout the video stream.




For real-world deployment, EndoStreamDepth processes individual frames sequentially, while the multi-level temporal modules maintain the temporal information across arbitrarily long sequences. \ul{Our work addresses a key limitation of existing endoscopic video depth methods: they either rely on multi-frame batched input or computationally intensive diffusion processing, both of which introduce significant latency}. Our main contributions are:

\begin{itemize}
    \item We introduce an Endoscopy-Specific Transformation (EST) that models typical geometric and photometric variations in endoscopic video and can be integrated into existing image- and video-based depth foundation models to improve robustness.

    \item We design a single-frame depth foundation model with a temporal module that leverages inter-frame information for real-time streaming video depth estimation, and we train it with comprehensive supervision to produce accurate, sharp depth maps. 


    
    \item We propose a hierarchical architecture with multi-level temporal modules, optimized with multi-scale supervision and self-supervised temporal consistency regularization, to produce accurate and temporally consistent depth maps for video streams.




\end{itemize}


The comprehensive experiments were conducted on publicly available endoscopy metric depth benchmarks, 
% where the proposed EndoStreamDepth shows superior performance to state-of-the-art depth estimation methods. 
where our quantitative evaluation is conducted on phantom and simulated data with ground truth depth, and the proposed EndoStreamDepth shows superior performance to state-of-the-art depth estimation methods.
To the best of our knowledge, this is the first work to stream metric depth estimation for endoscopy with hierarchical temporal modules for arbitrarily long sequences. This work provides a robust, reproducible approach for streaming depth estimation for endoscopic videos.















