




\begin{figure}[t]
\centering
\includegraphics[width=0.8\linewidth]{figures/framework.pdf}
 \caption{Overview of the \textbf{EndoStreamDepth} framework. 
(a) Endoscopy-specific Transformation (EST) is applied to model typical variations in endoscopy. 
(b) The single-frame depth network predicts depth map $\hat{D}_t$ from frame $I_t$. 
(c) The video stream depth network further incorporates Mamba modules that receive hidden states $H_{t-1}, H_t$ to propagate information across frames, improving depth predictions. Frames are processed sequentially (streaming), not simultaneously.}
\label{framework}
\end{figure}


% scale-invariant depth
% metric shape/scale
% boundary sharpness
% temporal smoothness


\section{Methods}
\textbf{EndoStreamDepth} (Fig.\ \ref{framework}) targets monocular depth estimation from endoscopic video streams. 
Given an endoscopic video $\{I_t\}_{t=1}^T$, where 
$I_t \in \mathbb{R}^{H \times W \times 3}$ is the $t^{th}$ RGB frame, it predicts a sequence 
of depth maps $\{\hat{D}_t\}_{t=1}^T$, with $\hat{D}_t \in \mathbb{R}^{H \times W}$, that 
are spatially accurate, temporally consistent, with sharp boundaries, and suitable for real-time processing. The framework consists of three components: (1) a single-frame depth backbone with endoscopy-specific transformations to improve robustness, (2) a streaming temporal module based on Mamba that propagates information across frames and is trained with multi-term supervision, and (3) a hierarchical multi-level design that refines depth from local details (edges and fine structures) to global geometry, and enforces a self-supervised temporal consistency loss to stabilize predictions over time.







\subsection{Single-frame depth network}
\label{single-frame}
We first build a single-frame network for accurate depth estimation from endoscopic images.
Our single-frame depth network $f_\theta$ (Fig.~\ref{framework}(b)) consists of a Vision Transformer encoder \cite{oquab2023dinov2} and a DPT decoder \cite{ranftl2021vision}. To achieve accurate performance \cite{li2025monocular}, it adapts pretrained weights from existing large-scale monocular depth foundation models, specifically, DepthAnythingv2 \cite{yang2024depthv2}. We use ViT-L as the backbone with 24 transformer blocks. The DPT decoder fuses multi-scale features from the encoder using residual blocks \cite{he2016deep} to predict the depth map $\hat{D}_t$ from a given frame $I_t$.
This single-frame depth network $f_\theta$ serves as the backbone for all subsequent temporal and hierarchical extensions.

Following DepthAnythingv2, we train $f_\theta$ using a scale-invariant logarithmic (SiLog) loss, which is defined as:
\begin{equation}
\label{SiLog}
% \mathcal{L}_{\text{si}}(t)
% = \sqrt{
%     \frac{1}{N} \sum_{i=1}^{N}
%         \big( \log D_t(i) - \log \hat{D}_t(i) \big)^2
%     - \lambda \left(
%         \frac{1}{N} \sum_{i=1}^{N}
%         \big( \log D_t(i) - \log \hat{D}_t(i) \big)
%     \right)^2
% }
\mathcal{L}_{\text{si}}(t)
= \sqrt{
    \frac{1}{N} \sum_{i=1}^{N}
        \big( \log D_t(i) - \log \hat{D}_t(i) \big)^2
    - 0.5 \left(
        \frac{1}{N} \sum_{i=1}^{N}
        \big( \log D_t(i) - \log \hat{D}_t(i) \big)
    \right)^2
}
\end{equation}
where $i$ indexes pixels with positive ground truth depth in $D_t$.
The loss is scale-invariant and encourages low variance 
and low bias in the log-depth residuals.



% $N$ is the number of such pixels, and we set $\lambda = 0.5$ in our experiments. This loss is invariant to a global scale factor in depth and encourages both low variance and low bias in the log-depth residuals.



\paragraph{Endoscopy-Specific Transformation (EST).}
Existing depth foundation models \cite{yang2024depthv1,tian2024endoomni} mainly rely on large-scale training data, and do not explicitly consider task-specific variations. For endoscopic depth estimation, task-specific augmentation can be more helpful, but this is rarely discussed in existing work.

% this can be expanded in the appendix (for example, Depth Anything \cite{yang2024depthv1} only uses horizontal flipping, and EndoOmni applies color jittering and Gaussian blur)


To increase the robustness of the depth network, we design a simple yet effective, endoscopy-specific augmentation pipeline that combines geometric transformations (random $90^\circ$ rotation, horizontal and vertical flips) with photometric perturbations (blur, defocus, brightness/contrast, gamma correction, fog). Examples can be seen in Fig.\ \ref{EST}. These transformations are applied to training data before they are fed into the model (Fig.~\ref{framework}(a)), either per frame or per window sequence for frame- and video-based methods. 


In clinical procedures, the endoscopic camera is often rotated. Due to the approximately symmetric field of view of the endoscope, the same anatomy may appear in different orientations. The geometric transformations therefore increase the robustness of the network to such viewpoint changes, which are typical in endoscopic applications. The photometric perturbations simulate common endoscopic artifacts, including motion blur, specular reflections from illumination, and smoke and fogging caused by tissue cutting, which frequently occur during surgery. Training the network with these variations improves the robustness of depth estimation from endoscopic images.  The details can be found in Appendix~\ref{EST_illustration}.









\subsection{Video stream depth network}
Building on the baseline $f_\theta$ (Sec.~\ref{single-frame}) and inspired by the recent FlashDepth work \cite{chou2025flashdepth}, we extend $f_\theta$ to a video stream depth network for endoscopy (Fig.~\ref{framework}(c)). The network operates in a streaming mode. At time step $t$, the current frame $I_t$ is passed through the  $f_\theta$ to obtain a feature map. This feature map, together with the hidden state from the previous step $H_{t-1}$, is then fed into a Mamba module.  The Mamba module updates the hidden state and produces a refined feature map for $I_t$, which is decoded by the output head into the depth map $\hat{D}_t$. The updated hidden state $H_t$ is stored and used at the next time step $t+1$. Over time, this recurrent process propagates temporal information across frames and yields a sequence of depth predictions $\{\hat{D}_t\}$ with improved temporal consistency.


FlashDepth uses only an $\ell_1$ loss on metric depth.
This loss penalizes far-range errors more strongly than near-range errors and provides weaker constraints on the global depth structure than the SiLog loss
(Eq.~\ref{SiLog}). In endoscopic video, clinically relevant anatomy often lies at near- and mid-range
and requires accurate local geometry and clear anatomical boundaries. We therefore use a SiLog loss together with two auxiliary log-depth losses that emphasize near-range accuracy and boundary sharpness, described below.





\paragraph{Learning metric depth.}
In addition to the temporal module, we supervise $\hat{D}_t$ using a log-domain $\ell_1$ loss defined on the ground truth depth map $D_t$, given by:

\begin{equation}
\label{l1_loss}
\mathcal{L}_{\text{metric}}(t)
= \frac{1}{N} \sum_{i=1}^{N}
\left|
    \log D_t(i)
    - \log\hat{D}_t(i)
\right|
\end{equation}
Since $D_t$ is given in physical units (millimeters), minimizing $\mathcal{L}_{\text{metric}}$ encourages both correct metric scale and a coherent global depth shape. Applying a log transform makes the error relative that penalizes $\log(D_t/\hat{D}_t)$ rather than raw differences. This compresses the depth range and reduces the influence of far-range pixels in the supervision, while maintaining accuracy around the near-range region.




\paragraph{Learning sharp edges.}
To sharpen anatomical boundaries, we use an edge-aware loss that emphasizes discrepancies in depth gradients around structure boundaries to preserve thin structures such as lumen borders. Let $G_x(\cdot)$ and $G_y(\cdot)$ denote forward finite differences in the horizontal and vertical directions. The gradient-based edge loss is defined as:
\begin{equation}
\label{edge_loss}
\mathcal{L}_{\text{edge}}(t)
= \frac{1}{N} \sum_{i=1}^{N}
\left(
    \big| G_x(\log D_t)(i) - G_x(\log \hat{D}_t)(i) \big|
    + \big| G_y(\log D_t)(i) - G_y(\log \hat{D}_t)(i) \big|
\right)
\end{equation}
This penalizes differences in depth gradients between prediction and ground truth, and thus yields depth maps with sharper, more anatomically aligned boundaries.





\begin{figure}[t]
\centering
\includegraphics[width=0.86\linewidth]{figures/multi_level_new.pdf}
\caption{
Multi-level temporal Mamba integration within the decoder.
For each decoder level $l$, the feature tokens of the current frame $I_t$ are fused with the $l-1$ features and passed through a Mamba module
that receives the hidden state $H^{(l)}_{t-1}$ as temporal context. The module outputs an updated hidden state $H^{(l)}_{t}$, which is propagated
to the next frame $I_{t+1}$. The right panel illustrates a single Mamba module implemented
as a stack of Mamba blocks with state-space model (SSM) layers, each maintaining a recurrent hidden state $h_t$ that is updated to $h_{t+1}$ at the next time step. For brevity, we denote these internal SSM states by $h_t$ without block indices. They are distinct from the decoder-level states
$H_t^{(l)}$, and each SSM layer passes its own hidden state to the corresponding layer at the next time step. Decoder processes for $I_{t-1}$ and $I_{t+1}$ are identical to that for $I_t$ and are omitted.
}


\label{multi_level}
\end{figure}





\subsection{Hierarchical multi-level architecture and supervision}
The final component of EndoStreamDepth is a hierarchical multi-level design (Fig.~\ref{multi_level}) that combines temporal Mamba modeling and supervision across scales. Instead of predicting depth at a single resolution, the network produces a pyramid of depth maps and temporal features, which allows it to capture both local fine details and global geometry consistently.





\paragraph{Multi-level temporal consistency.}
We extend the temporal branch (Fig.~\ref{framework}(c)) to four feature
levels by applying Mamba modules at multiple spatial resolutions (Fig.~\ref{multi_level}). Let $l \in \{1,\dots,4\}$ index these levels, from finest ($l=1$) to coarsest ($l=4$). At each level $l$, the Mamba module
contains four Mamba blocks $b\in \{1,\dots,4\}$, and each block has a single state-space model
(SSM) layer that processes the temporal sequence. This yields, at time $t$,
a set of block-wise hidden states $\{h_t^{b,(l)}\}_{b=1}^4$. For brevity,
we denote the multi-block hidden state by $H_t^{(l)}$ and use
$h_t$ to refer to a generic block-wise state when indices are omitted.
This multi-level temporal modeling allows the network to exploit temporal
information at both coarse and fine feature levels, reduces flickering, and provides temporally coherent features.




\paragraph{Multi-scale deep supervision.}
As shown in Fig.~\ref{multi_level}, the hierarchical decoder produces a pyramid of depth predictions
$\{\hat{D}_t^{(l)}\}_{l=1}^{4}$. The ground-truth depth map $D_t$ is
downsampled to these resolutions for supervision, yielding
$\{D_t^{(l)}\}_{l=1}^{4}$. At each level $l$ we apply the same SiLog loss
(Eq.~\ref{SiLog}) to constrain the overall depth shape. Denoting by $\mathcal{L}_{\text{si}}^{(l)}(t)$ the
SiLog loss computed between $\hat{D}_t^{(l)}$ and $D_t^{(l)}$, the
multi-scale deep supervision is defined as:
\begin{equation}
\mathcal{L}_{\text{ms}}(t)
= \sum_{l=1}^{4} w_l \, \mathcal{L}_{\text{si}}^{(l)}(t)
\end{equation}
We set $w_l = 1$, so that all scales contribute equally.
Supervising intermediate scales in this way stabilizes training and improves globally consistent depth across the pyramid.






\paragraph{Self-supervised temporal regularization.}
% Multi-scale supervision alone could be biased by interpolated ground truth
% depths at coarse scales. 
To maintain accuracy and temporal consistency, we introduce an additional temporal
regularization loss at the
finest scale $\hat{D}_t^{(1)}$. It penalizes frame-to-frame depth
fluctuations with a per-video normalization, so that depth
trajectories over time become smoother while spatial details are
preserved. For notational simplicity and consistency with Eqs.\ (\ref{SiLog}–\ref{edge_loss}), we denote the finest-scale prediction by $\hat{D}_t$.

For a given video with predictions $\{\hat{D}_t\}_{t=1}^T$, we first compute
a robust per-video normalization. All predicted depths $\hat{D}_t(i)$ across
frames and valid pixels are collected to compute a single median $m$ and a mean
absolute deviation
$a = \frac{1}{TN} \sum_{t=1}^T \sum_{i=1}^N |\hat{D}_t(i) - m|$.
The normalized depth is then given by
$\bar{D}_t(i) = (\hat{D}_t(i) - m)/a$, which standardizes the depth scale
across videos without altering the spatial structure of the depth maps.

% The temporal regularization loss is defined as:
% \begin{equation}
% \mathcal{L}_{\text{temp}}(t)
% = \frac{1}{N} \sum_{i=1}^N \bigl| \bar{D}_t(i) - \bar{D}_{t-1}(i) \bigr|,
% \end{equation}
% which directly penalizes frame-to-frame fluctuations in the normalized
% depth and encourages temporally stable predictions.


% The temporal regularization loss is then defined as:
% \begin{equation}
% \mathcal{L}_{\text{temp}}
% = \frac{1}{(T-1)N} \sum_{t=1}^{T-1} \sum_{i=1}^N
% \big| \bar{D}_{t+1}(i) - \bar{D}_t(i) \big|
% \end{equation}
% and it is averaged over training videos. This regularizer penalizes inconsistent depth changes across frames while being invariant to per-video scale and offset, resulting in temporally smoother yet spatially detailed depth predictions.


The self-supervised temporal regularization loss is defined over a windowed video as:
\begin{equation}
\label{temp_loss}
\mathcal{L}_{\text{temp}}
= \frac{1}{(T-1)N} \sum_{t=1}^{T-1} \sum_{i=1}^N
\big| \bar{D}_{t+1}(i) - \bar{D}_t(i) \big|,
\end{equation}
which penalizes inconsistent depth changes between frames while
remaining invariant to per-video scale and offset, resulting in temporally
smoother yet spatially detailed depths.




\paragraph{Total objective function.}
For each training video, EndoStreamDepth is optimized using the four
complementary loss terms (Eq.~(\ref{SiLog}–\ref{temp_loss})).
The total training objective is:

\begin{equation}
\mathcal{L}_{\text{total}}
= \frac{1}{T} \sum_{t=1}^{T}
\big(
    \mathcal{L}_{\text{ms}}(t)
  + \mathcal{L}_{\text{metric}}(t)
  + \mathcal{L}_{\text{edge}}(t)
\big)
+ \ 0.01 \, \mathcal{L}_{\text{temp}}
\end{equation}
It jointly enforces accurate metric depth, sharp boundaries, multi-scale consistency, and temporal coherence, which are crucial for endoscopic video depth estimation.




