\documentclass{midl} % Include author names

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution
\usepackage{tikz}
\usetikzlibrary{spy}
\usepackage{multirow}
\usepackage{float}
\usepackage{hyperref}
\usepackage{booktabs}
\usepackage{array}
\newcolumntype{?}{!{\vrule width 1pt}}


\input{newcommands}

\jmlryear{2026}\jmlrworkshop{Full Paper -- MIDL 2026}\jmlrvolume{-- nnn}\editors{Accepted for publication at MIDL 2026}

\title[Endo-4DTS]{Endo-4DTS: Monocular 4D Scene Synthesis for Endoscopy via Deformable Triangle Splatting}

\midlauthor{\Name{Laura Salort-Benejam} \orcid{0009-0000-6864-0854} \Email{laura.salort@upc.edu}\\
\Name{Antonio Agudo}\orcid{0000-0001-6845-4998} \Email{antonio.agudo@upc.edu}\\
\addr Institut de Robòtica i Informàtica Industrial, CSIC-UPC, Barcelona, Spain
}

\begin{document}

\maketitle

\begin{abstract}
Endoscopy is an essential procedure in medical imaging, routinely applied for diagnostic, prognostic and therapeutic purposes. Developing robust methods for 3D reconstruction of endoscopic videos has the potential to improve the visualization of complex anatomies, increase diagnostic accuracy, and guide surgical procedures. Despite recent advancements the task remains highly challenging. The deformable nature of soft tissues makes classical computer-vision algorithms useless, and additional difficulties arise from the widespread use of monocular cameras, unknown camera parameters, occlusions, illumination changes, motion blur and other artifacts. In this work, we present Endo-4DTS, a novel self-supervised pipeline based on triangle splatting for 4D scene synthesis of deformable endoscopy scenes from monocular videos with a static camera, the first time this type of solution is proposed to endoscopic images and in time-varying tissues. Our approach represents the endoscopic environment with a canonical set of triangles, optimized jointly with a deformation network, enabling consistent 4D synthesis of dynamic tissues. We incorporate additional geometric and depth-based objectives that further guide learning in the challenging context of deformable endoscopic scenes. Experiments on several endoscopic videos with non-rigid tissues, occlusions and illumination changes, show that Endo-4DTS reliably captures soft-tissue deformations in endoscopic scenes. We demonstrate that Endo-4DTS consistently outperforms previous state-of-the-art methods across multiple metrics.
\end{abstract}


\begin{keywords}
Triangle splatting, differentiable rendering, endoscopy, non-rigid tissues.
\end{keywords}


\section{Introduction}

Endoscopy has become an essential medical imaging modality for examining the human body across a wide range of interventions, for diagnostic, prognostic and therapeutic purposes. Despite the existence of stereo and RGB-D cameras that provide additional depth information, their larger size requires bigger incisions and limits clinical applicability. As a result, monocular cameras, with their compact design and versatility, remain the most widely used in endoscopic devices today~\cite{stereo_not_common,Endo_tech_today}. The wide variety of possible endoscopic interventions--such as colonoscopies, bronchoscopies or arthroscopies-- presents a major challenge for recovering 3D information from visual cues, as the models must be able to adapt to significant changes in anatomy, appearance, illumination, camera motion and tissue deformation. As a result, there is a strong need for generic, robust models capable of interpreting visual endoscopic data across different anatomies and deformation patterns. These would benefit both patients and clinicians by enabling 3D visualization of anatomical structures, facilitating the assessment of regions that are difficult to inspect due to restricted viewpoints, and allowing patient-specific reconstructions to be revisited during follow-up examinations to monitor disease progression.

However, recovering 3D information from monocular videos is fundamentally challenging. Classical methods--including rigid and non-rigid Structure-from-Motion (SfM)~\cite{AgarwalICCV2009,schoenberger2016sfm,AgudoPAMI2016,agudoICPR2020,MCPD}, shape-from-template~\cite{defslam}, shape-from-shading \cite{LightDepth}, photometric stereo~\cite{photometric-stereo}, and supervised approaches~\cite{bimodal_camera_pose}--typically depend on explicit correspondences or strong priors on motion and deformation to solve the problem. These assumptions are rarely satisfied in endoscopies, where biological tissues suffer non-rigid deformations, making 3D reconstruction inherently ill-posed. Moreover, endoscopic videos introduce  additional challenges that can decrease the robustness of the shape and camera estimations of traditional algorithms, such as
specular highlights caused by the light source, abrupt camera movements, limited and highly constrained viewpoints, occlusions from tools or fluid, low-texture regions, motion blur, illumination changes, etc.

Recent advances in neural rendering have greatly improved 3D reconstruction under challenging imaging conditions. Neural Radiance Fields (NeRF)~\cite{original-NERF} introduced a powerful implicit representation that enables high-quality novel-view synthesis through differentiable volume rendering. Despite numerous extensions for pose estimation~\cite{SCNeRF2021,bundlesdfwen2023}, acceleration~\cite{instant-npg15,fridovich2022plenoxels}, and dynamic scenes~\cite{d-nerf,nerfies,4DPV}, NeRF remains limited by its reliance on accurate camera poses, heavy computation times, and difficulty handling uncontrolled lighting and non-rigid motion. Early adaptations to endoscopy settings such as EndoNeRF~\cite{endoNERF} and follow-up variants~\cite{endoSURF,lerplane,ortho-neural-plane,salortISBI26} assume fixed cameras and depend on stereo or auxiliary depth cues to solve the 3D reconstruction problem on deformable scenes, limiting their applicability to many endoscopy procedures where such information is not available.

The recent emergence of 3D Gaussian Splatting (3DGS)~\cite{gaussiansplatting} offers a more efficient explicit representation of the scene through a set of learnable anisotropic 3D Gaussians, enabling real-time rendering with significantly faster optimization. However, its formulation assumes rigid scenes and requires SfM camera calibration and point-cloud for initialization, which would be unfeasible in endoscopy settings as this classical method fails when faced with non-rigid scenes. Several works have extended the original work for deformable environments~\cite{dynamic3dgs,spacetime-gs,deformable3dgs,gaussian-flow,motiongs,per-gaussian}, typically by using a neural network to model deformations. Others have adapted them to endoscopy~\cite{endogaussian,endo4dgs,endogs,surgicalgaussian,deform3dgs}, however they still rely on rigid-camera assumptions, SfM, or external depth estimation. Meanwhile, works on camera estimation for endoscopy~\cite{pancakes,endogslam} do not explicitly model tissue deformation.

One of the latest works on 3D scene reconstruction and differentiable rendering is triangle splatting~\cite{trianglesplatting26}, that proposes using triangles as primitives for efficient high-quality scene representation. By using a set of unstructured disconnected triangles this approach leverages the latest advancements in computer graphics for GPU-accelerated triangle processing and incorporates it in a fully-differentiable pipeline. This approach has shown remarkable results in visual fidelity, training and rendering speed as well as high-quality novel view synthesis, but it is only intended for rigid scenes with known camera parameters, usually calibrated with SfM, and also requires the resulting sparse point cloud for initialization. 

In this work we propose Endo-4DTS to extend the {\em triangle splatting} approach to deformable scenes, specially for endoscopic procedures, without relying on additional priors such as known camera calibration, pre-computed templates or feature-based approaches for initialization. To the best of our knowledge, this work is the first to adapt the triangle splatting-based approach to deformable scenes, and the first to apply it to endoscopies.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Method}
Our work builds on triangle splatting~\cite{trianglesplatting26}, which represents a 3D scene using learnable triangle primitives, similar to 3DGS~\cite{gaussiansplatting} but replacing the Gaussians with triangles. Each triangle is defined by three vertices $\textbf{v}_i \in \mathbb{R}^3$, a color $\textbf{c}$ represented with Spherical Harmonics, a smoothness parameter $\sigma$, and an opacity $o$. Triangles are initialized from a sparse point cloud obtained via SfM and refined through adaptive densification and pruning strategy as in~\cite{MCMC}.

Triangle splatting~\cite{trianglesplatting26} first projects the 3D triangles to the image plane and then computes the final pixel values with point-based rendering, making it less computationally demanding than NeRF-based approaches. This projection of the 3D triangles to image space is done using a standard pinhole camera model $\overline{\textbf{p}}_i=\textbf{K}(\textbf{R}_{cam}\textbf{v}_i+\textbf{t}_{cam})$, where $\overline{\textbf{p}}_i\in \mathbb{R}^3$ is the homogeneous coordinate in image space of the projected triangle vertex $\textbf{v}_i \in \mathbb{R}^3$, $\textbf{p}_i\in \mathbb{R}^2$ is the same pixel coordinate in Euclidean space, $\textbf{K}$ is a 3$\times$3 known intrinsic camera matrix, and $\textbf{R}_{cam}$ and $\textbf{t}_{cam}$ are the rotation and translation, respectively, that define the camera pose. 
The projected triangles are then sorted by distance to the camera and the color $\bzeta$ for each pixel $\textbf{p}$ is obtained by accumulating the contribution of each overlapping triangle, as defined by the rendering equation in previous works~\cite{gaussiansplatting,convex}:
\begin{equation}
\label{eq:point-rendering}
\bzeta(\textbf{p})=\sum_{n=1}^{N_T}\textbf{c}_no_n\mathcal{W}(\textbf{p})\prod_{j=1}^{n-1}\bigl(1-o_j\mathcal{W}(\textbf{p})\bigl),
\end{equation}
where $N_T$ is the total number of triangles, $\textbf{c}_n$ is the learned color of the \textit{n}-th triangle, and $\mathcal{W}(\textbf{p})$ is a window function that smoothly influences the contribution of each projected triangle based on the distance from the triangle's incenter $\textbf{s}_{inc} \in \mathbb{R}^2$, that is defined as:
\begin{equation}\label{eq:window}
    \mathcal{W}(\textbf{p})=\text{ReLu}\biggl( \frac{\rho(\textbf{p})}{\rho(\textbf{s}_{inc})} \biggl)^\sigma.
\end{equation}

Here, the learnable parameter $\sigma>0$ controls the sharpness of the window function over  $\rho$--the Signed Distance Field (SDF) of the triangle in image space. Smaller values of $\sigma$ produce sharper boundary transitions while larger values result in smoother transitions from the triangle boundary towards the incenter $\textbf{s}_{inc}$, where $\mathcal{W}(\textbf{p})$ attains its maximum value. The SDF is expressed as:

\begin{equation}\label{eq:sdf}
    \rho(\textbf{p})=\max_{i\in \{1,2,3\}} \textbf{n}_i\cdot\textbf{p}+d_i^{SDF},
\end{equation} 
where $\textbf{n}_i$ are defined as the unit normals of the triangle edges pointing outwards and $d_i$ are the offsets making the triangle the zero-level set of $\rho$.

In this work, we propose Endo-4DTS, a framework that extends triangle splatting~\cite{trianglesplatting26} for 3D estimation of deformable scenes. Our approach introduces a canonical space representation in which a static set of 3D triangles is jointly optimized with a deformation network that estimates the rotation, translation, scale and color changes of these canonical triangles to the deformed space, see the proposed pipeline in \figureref{fig:model-overview}. Several additional losses are used to improve the photometric and geometric properties of the triangle representation.

Given a monocular input video containing deformable tissues, our model takes as input $\bigl\{(\textbf{I}_m,\textbf{D}_m, t_m)_{m=1}^M, \textbf{P},f\bigl\}$, where $M$ is the total number of frames.  Specifically,  $\textbf{I}_m \in \mathbb{R}^{H \times W \times 3}$ represents the \textit{m}-th monocular RGB video frame, with height H and width W,  $\textbf{D}_m \in \mathbb{R}^{H \times W}$ is the corresponding estimated monocular depth map,  $t_m = m/M \in [0,1]$ denotes the normalized time index of the \textit{m}-th frame,  $\textbf{P} \in \mathbb{R}^{4\times4}$ represents the static camera pose, and $f$ is the focal length of the camera used to compose the intrinsic camera matrix $\textbf{K}$. While our model assumes a fixed camera pose across time, this assumption is reasonable for many endoscopic procedures in which the endoscope is intentionally kept static to maintain a stable field of view to allow precise tissue manipulation by the physician. It is worth noting that camera motion may occur in gastroscopies and exploratory procedures, however our formulation could easily incorporate per-frame extrinsic information, such as $\textbf{P}_m \in \mathbb{R}^{4\times4}$, when available.


\begin{figure}[t!] 
\floatconts{fig:model-overview}
    {\caption{\textbf{Overview of our Endo-4DTS} pipeline for deformable triangle splatting from monocular endoscopy videos with static camera. Our method decomposes the structure information in canonical and deformed spaces, to capture the rigid and non-rigid contributions, respectively.}}
    {\centering
      \includegraphics[width=0.9\linewidth]{images/TRIANGLE INITIALIZATION (2).png}}
\end{figure}


To make our method applicable to a wider set of videos, we avoid the need for stereo-depth inputs by using video depth anything~\cite{video_depth_anything}, a pre-trained monocular depth estimation algorithm specially tailored for time consistency in videos, as a pseudo ground truth for our depth regularization.

\subsection{Deformation network}
The proposed deformation network $G_\Phi$ consists of an 8-layer Multilayer Perceptron (MLP) that takes as input the position of the triangles in the canonical space $\textbf{v}_i$ and the time $t_m$ of the current frame. Following~\cite{original-NERF}, separate positional encoding $\eta$ of the inputs is applied to avoid overly smoothed representations, encouraging the MLP to learn high-frequency functions to represent the scene.

The deformation network will then predict for each triangle a rotation $\textbf{e} = (\theta,\psi,\phi)$, encoded using Euler angles, a translation $\delta\textbf{v} \in \mathbb{R}^3$, a scaling offset $\delta s\in \mathbb{R}$, and a new RGB color $\textbf{c}' \in\mathbb{R}^3$:
\begin{equation}
\bigl(\textbf{e},\delta\textbf{v},\delta s,\textbf{c}'\bigl)  = G_\Phi\Bigl(\eta \bigl([\textbf{v}_1,\textbf{v}_2,\textbf{v}_3]\bigl),\eta(t)\Bigl).
\end{equation} 

Following the pitch-roll-yaw convention we obtain the final rotation matrix $\textbf{R} \in SO(3)$ and define the homogeneous transform $\textbf{Q}=\begin{bmatrix}
        \textbf{R}&\delta\textbf{v}\\
        \textbf{0} & 1
    \end{bmatrix} \in SE(3)$ and apply it to the triangles in the canonical space to obtain their transformed position as $
\textbf{v}_i'=\textbf{Q}[\textbf{v}_i^{\top}\ 1]^{\top} = \textbf{R}\textbf{v}_i+\delta\textbf{v}$.


\subsection{Optimization}

The final loss consists of several terms, those used to optimize the canonical scene, denoted by the superscript $^C$, and those applied to the deformation network, indicated by the superscript $^D$:
\begin{equation}\label{eq:final-loss}
    \begin{split}
        \mathcal{L} = &\, (1-\lambda_1)\mathcal{L}_1^C 
        +\lambda_1 \mathcal{L}_{D-SSIM}^C
        +\lambda_2\mathcal{L}_o^C
        +\lambda_3\mathcal{L}_\textbf{N}^C
        +\lambda_4\mathcal{L}_s^C 
        +\lambda_5\mathcal{L}_{depth}^C
        +\lambda_6\mathcal{L}_{smooth}^C \\
        &+(1-\lambda_7)\mathcal{L}_1^D 
        + \lambda_7 \mathcal{L}_{D-SSIM}^D
        +\lambda_8\mathcal{L}_\textbf{N}^D 
        +\lambda_9\mathcal{L}_s^D        
        +\lambda_{10} \mathcal{L}_{pos}^D
        +\lambda_{11}\mathcal{L}_{rot}^D
        +\lambda_{12}\mathcal{L}_{\delta\textbf{v}}^D ,
    \end{split}
\end{equation}
where $\lambda_{1-12}$ are weighting factors. Next, we define every term in the global loss.

\paragraph{Color loss:} Photometric fidelity is enforced using $\mathcal{L}_1$, the $l_1$-norm loss between the input and rendered images and a Differential-Structural Similarity Index Measure (SSIM) loss  $\mathcal{L}_{D-SSIM}=1-SSIM(\textbf{x},\textbf{y})$, in both canonical and deformed spaces.

\paragraph{Opacity loss $\mathcal{L}_o^C$:} Opacity is regularized to avoid overly transparent or saturated triangles in the canonical space following~\cite{MCMC} $\mathcal{L}_o^C=\frac{1}{N_T}\sum_{n=1}^{N_T}|o_n|$, where $o_n$ is the opacity of the \textit{n}-th triangle and $N_T$ the total number of them.  

\paragraph{Size loss $\mathcal{L}_s^C$ and $\mathcal{L}_s^D$:} Small or degenerate triangles are penalized in the canonical and transformed spaces via size regularization $\mathcal{L}_s=2\|(\textbf{v}_1-\textbf{v}_0)\times(\textbf{v}_2-\textbf{v}_0)\|_2^{-1}.$ 

\paragraph{Depth loss $\mathcal{L}_{depth}^C$:} We incorporate a depth loss term to the canonical representation to encourage the triangle soup to maintain triangles close to the pseudo-ground truth surface, defined as the $l_1$-norm loss between the estimated and input depth maps as:
\begin{equation}
    \label{loss}
    \mathcal{L}_{depth}^C=\Big\| \hat{\textbf{D}}(\textbf{p}) -\textbf{D}(\textbf{p}) \Big\|_1 .
\end{equation}

\paragraph{Depth smoothness loss $\mathcal{L}_{smooth}^C$:} Additionally, we incorporate a depth smoothness loss, similar to~\cite{smooth-loss}, enforcing neighboring pixels to have close depth values by using second-order gradients of the estimated depth:
\begin{equation}
    \mathcal{L}_{smooth}^C= e^{-\nabla^2\textbf{D}(\textbf{p})} \Bigl(\bigl| \nabla_{xx} \hat{\textbf{D}}(\textbf{p} )\bigl| + \bigl|
        \nabla_{xy} \hat{\textbf{D}}(\textbf{p} ) \bigl|+ \bigl|\nabla_{yy}\hat{\textbf{D}}(\textbf{p} )\bigl| \Bigl),
\end{equation}
where $\nabla^2\textbf{D}(\textbf{p})$ is the Laplacian of the input depth, whose exponential is used as a weighting term to assign less importance to pixels that are more likely to be edges and discontinuities.


\paragraph{Normal loss $\mathcal{L}_\textbf{N}^C$ and $\mathcal{L}_\textbf{N}^D$:} This term is applied to both the canonical and deformed spaces to encourage the rendered triangles normals $\textbf{n}(\textbf{p})$ to align with the pseudo-ground truth surface normals $\textbf{N}(\textbf{p})$ using: 
\begin{equation}\label{eq:normal_loss}
    \mathcal{L}_\textbf{N}^C=1-\textbf{n}(\textbf{p})^\top\textbf{N}(\textbf{p}).
\end{equation}


\paragraph{Rotation and translation losses:} These two terms are used to encourage the deformation network to estimate rotations and translations close to zero, to avoid large deformations that would lead to a degenerate solution. They are defined as an $l_2$-norm loss:
\begin{align}
    \mathcal{L}_{\text{rot}}^D &= \|\boldsymbol{\textbf{e}}\|_2, 
    \quad \\
    \mathcal{L}_{\delta \textbf{v}}^D &= \|\mathbf{\delta v}\|_2.
\end{align}

\paragraph{Position consistency loss $\mathcal{L}_{pos}^D$:} Following~\cite{surgicalgaussian}, we introduce a consistency prior on the deformation of neighboring triangles. The underlying intuition is that spatially close triangles in the canonical space should remain similarly close after deformation. This loss encourages the relative distances between the $K$ nearest neighbors to be preserved after deformation, thus promoting coherent local movements and reducing the risk of degenerate distortions. The position consistency loss is formally defined as:
\begin{equation}
\mathcal{L}_{pos}^D=\sum_{n=0}^{N_T}\sum_{k=1}^K \bigg\|d\Bigl( \textbf{x}_c^{(n)},\textbf{x}_c^{(k)}\Bigl) - d\Bigl( \textbf{x}_o^{(n)},\textbf{x}_o^{(k)}\Bigl) \bigg\|_1,
\end{equation}
where $\textbf{x}_c^{(n)}$ and $\textbf{x}_o^{(n)}$ denote the center of the \textit{n}-th triangle in canonical and deformed space, respectively, and $d(\cdot,\cdot)$ represents a Euclidean distance. 

\section{Experimental results}
\subsection{Implementation details}
We initialize the scene following Surgical Gaussian~\cite{surgicalgaussian}, without applying masks to remove the surgical tools. Canonical triangles are initialized with the same parameters as triangle splatting~\cite{trianglesplatting26}, and for the first 100 iterations only the canonical representation is optimized. Then the deformation network is initialized and optimized up to 40k iterations. Densification and pruning are done every 500 iterations and stop after 5k iterations, and the canonical scene is frozen after 6k iterations, as further optimization provides no benefit.

We present our experimental results on the Endo-NeRF~\cite{endoNERF} dataset, which consists of two robotic prostatectomy stereo videos recorded with a static camera, with different deformations, such as {\em pulling} and {\em cutting} of soft tissues. Since our method is designed for monocular inputs, we ignore all stereo information and only use the left frames for both input and depth estimation. Additionally, the video is processed in smaller sequences to avoid the instability introduced by long-range temporal dependencies. This reduces optimization complexity, confines the deformation network to local motion, and yields more stable and efficient convergence. 
For quantitative evaluation, we provide a Peak Signal-to-Noise Ratio (PSNR)~\cite{psnr}, Structural Similarity Index Measure (SSIM)~\cite{ssim} and Learned Perceptual Image Patch Similarity (LPIPS)~\cite{lpips}.

For further experimental validation, we considered the dataset introduced in NeRFscopy~\cite{salortISBI26}, which is composed of four in-vivo monocular surgical videos: two Totally Endoscopic Coronary Artery Bypass (TECAB) procedures~\cite{AgudoICIP2021,heart}, a lung lobectomy~\cite{hamlynsurgeryvid}, and a bronchoscopy~\cite{UrdapilletaICIP2023}. All these sequences exhibit mild to severe tissue deformations, different anatomies, and varying illumination conditions, which makes them more challenging. 

Additionally, we also used a clip from the StereoMIS~\cite{stereomis} dataset, captured during Da Vinci robotic surgery in in-vivo porcine subjects. For qualitative and quantitative evaluation of this scene see Appendix~\ref{app:extra}. 

\subsection{Loss terms ablation}
We conducted an ablation analysis on the main loss components of our model using the {\em pulling} sequence. Starting from the full Endo-4DTS loss in~\equationref{eq:final-loss}, we removed each term individually: the normal deformation loss $\mathcal{L}_\textbf{N}^D$, the normal canonical loss $\mathcal{L}_\textbf{N}^C$, the positional consistency loss $\mathcal{L}_{pos}^D$, the depth smoothness loss $\mathcal{L}_{smooth}^C$ and the depth loss $\mathcal{L}_{depth}^C$. Quantitative results in \tableref{tab:ablation} show an unexpected behavior in which omitting all new losses yields the best photometric results. These metrics, however, do not capture geometric consistency. Qualitative inspection, see \figureref{fig:ablation}, reveals that removing all losses produces visually sharp renderings but significantly noisier and less consistent depth and normal estimations, indicating that the deformation network becomes under-constrained and overfits to the photometric cues. Adding the depth loss leads to the largest improvement, removing depth discontinuities and stabilizing normals, though specular-related artifacts persist. Incorporating the normal losses mitigates these effects and improves rendering and normal quality, although minor depth floaters remain.
\begin{table}[htbp] 
\floatconts
{tab:ablation}
{\caption{\textbf{Quantitative results of the ablation analysis} on the {\em pulling} sequence. Best results in \textbf{bold}, second best \underline{underlined}.}}
{\centering
\begin{tabular}{cccc}
\hline
& PSNR$\uparrow$ & SSIM $\uparrow$& LPIPS $\downarrow$ \\
\hline
Endo-4DTS & \underline{38.113} & \underline{0.958} & \underline{0.044} \\
\hline
w/o $\mathcal{L}_\textbf{N}^D$ & 37.667 & 0.956 & 0.046  \\
w/o $\mathcal{L}_\textbf{N}^C$ & 37.116 & 0.953 &  0.046   \\
w/o $\mathcal{L}_{pos}^D$  & 37.661 & 0.955 & 0.045\\
w/o $\mathcal{L}_{smooth}^C$ & 37.411 & 0.953 & 0.047  \\
w/o $\mathcal{L}_{depth}^C$ & \textbf{38.304}  & \textbf{0.959} & \textbf{0.043} \\
\hline
\end{tabular}}
\end{table}

\begin{figure}[htbp]
\floatconts
    {fig:ablation}
    {\caption{\textbf{Qualitative results of the ablation analysis} on a frame from the {\em pulling} sequence. 
    \textbf{Row 1:} RGB images. \textbf{Row 2:} Zoomed in RGB images. \textbf{Row 3:} Surface depth. \textbf{Row 4:} Surface normal.}}
    {
    \centering
    \setlength{\tabcolsep}{2pt}
    \begin{tabular}[C]{c|cccccc}
        Reference & Endo-4DTS & w/o $\mathcal{L}_\textbf{N}^D$ & w/o $\mathcal{L}_\textbf{N}^C$& w/o $\mathcal{L}_{pos}^D$& w/o  $\mathcal{L}_{smooth}^C$ & w/o  $\mathcal{L}_{depth}^C$\\
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
          \begin{tikzpicture}
            \node[anchor=south west,inner sep=0] (img) 
              {\includegraphics[width=\textwidth]{images/ablation/frame-000054.color.png}};
            \draw[red,thick] (0.2,0.25) rectangle (0.95,0.8);
          \end{tikzpicture}
       \end{minipage} &
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
          \begin{tikzpicture}
            \node[anchor=south west,inner sep=0] (img) 
              {\includegraphics[width=\textwidth]{images/sensitivity/0_5/renders/00012.png}};
            \draw[red,thick] (0.2,0.25) rectangle (0.95,0.8);          \end{tikzpicture}
          \end{minipage} & 
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
            \begin{tikzpicture}
            \node[anchor=south west,inner sep=0] (img) 
              {\includegraphics[ width=\textwidth]{images/ablation/noLnD/00012.png}};
            % Draw rectangle (adjust coords!)
            \draw[red,thick] (0.2,0.25) rectangle (0.95,0.8);          \end{tikzpicture}
          \end{minipage} &
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
            \begin{tikzpicture}
            \node[anchor=south west,inner sep=0] (img) 
              {\includegraphics[width=\textwidth]{images/ablation/noLnD-noLnC/00012.png}};
            \draw[red,thick] (0.2,0.25) rectangle (0.95,0.8);          \end{tikzpicture}
          \end{minipage}&
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}

            \begin{tikzpicture}
            \node[anchor=south west,inner sep=0] (img) 
              {\includegraphics[width=\textwidth]{images/ablation/noLnD-noLnC-nopos/00012.png}};
            \draw[red,thick] (0.2,0.25) rectangle (0.95,0.8);          \end{tikzpicture}
          \end{minipage}&
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
            \begin{tikzpicture}
            \node[anchor=south west,inner sep=0] (img) 
              {\includegraphics[width=\textwidth]{images/ablation/noLnD-noLnC-nopos-nosmooth/00012.png}};
            \draw[red,thick] (0.2,0.25) rectangle (0.95,0.8);          \end{tikzpicture}
            \end{minipage}&
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
            \begin{tikzpicture}
            \node[anchor=south west,inner sep=0] (img) 
              {\includegraphics[width=\textwidth]{images/ablation/noLnD-noLnC-nopos-nosmooth-nodepth/00012.png}};
            \draw[red,thick] (0.2,0.25) rectangle (0.95,0.8);          \end{tikzpicture}
        \end{minipage} \\
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
          \includegraphics[width=\textwidth,trim={70 70 320 250},clip]{images/ablation/frame-000054.color.png}
        \end{minipage}&
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
          \includegraphics[width=\textwidth,trim={70 70 320 250},clip]{images/sensitivity/0_5/renders/00012.png}
        \end{minipage}& 
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
          \includegraphics[width=\textwidth,trim={70 70 320 250},clip]{images/ablation/noLnD/00012.png}
        \end{minipage}&
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
          \includegraphics[width=\textwidth,trim={70 70 320 250},clip]{images/ablation/noLnD-noLnC/00012.png}
        \end{minipage}&
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
          \includegraphics[width=\textwidth,trim={70 70 320 250},clip]{images/ablation/noLnD-noLnC-nopos/00012.png}
        \end{minipage}&
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
          \includegraphics[width=\textwidth,trim={70 70 320 250},clip]{images/ablation/noLnD-noLnC-nopos-nosmooth/00012.png}
        \end{minipage}&
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
          \includegraphics[width=\textwidth,trim={70 70 320 250},clip]{images/ablation/noLnD-noLnC-nopos-nosmooth-nodepth/00012.png}
        \end{minipage} \\
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
        \includegraphics[width=\textwidth]{images/ablation/frame_0054.png}
        \end{minipage}&
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/sensitivity/0_5/depth/00012_aligned.png}
        \end{minipage}& 
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/ablation/noLnD/depth-00012_aligned.png}
        \end{minipage}&
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/ablation/noLnD-noLnC/depth-00012_aligned.png}
        \end{minipage}&
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/ablation/noLnD-noLnC-nopos/depth-00012_aligned.png}
        \end{minipage}&
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/ablation/noLnD-noLnC-nopos-nosmooth/depth-00012_aligned.png}
        \end{minipage}&
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/ablation/noLnD-noLnC-nopos-nosmooth-nodepth/depth-00012_aligned.png}
        \end{minipage}\\

        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
        \includegraphics[width=\textwidth]{images/pulling/rend_normalCANON.png}
        \end{minipage}&
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/sensitivity/0_5/surf_normal_00000.png}
        \end{minipage}& 
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
          \includegraphics[ width=\textwidth]{images/ablation/noLnD/surf_normal_00012.png}
        \end{minipage}&
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/ablation/noLnD-noLnC/surf_normal_00012.png}
        \end{minipage}&
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/ablation/noLnD-noLnC-nopos/surf_normal_00012.png}
        \end{minipage}&
        \begin{minipage}{0.12\textwidth}\centering%
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/ablation/noLnD-noLnC-nopos-nosmooth/surf_normal_00012.png}
        \end{minipage}&
        \begin{minipage}{0.12\textwidth}\centering
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/ablation/noLnD-noLnC-nopos-nosmooth-nodepth/surf_normal_00012.png}
        \end{minipage}
    \end{tabular}}
\end{figure}


\subsection{Sensitivity study}
We conducted a sensitivity analysis of the loss weigths $\lambda_{11}$ and $\lambda_{12}$ associated with the rotation and translation regularizations, motivated by the fact that removing these terms caused the deformation network to become too under-constrained. As shown in \tableref{tab:sensitivity}, reducing either weight consistently degrades performance, indicating that stronger regularization improves optimization stability. Qualitatively, see \figureref{fig:sensitivity}, this is confirmed by enhanced rendering quality and more visually accurate geometry.

\begin{table}[htbp] 
\floatconts
    {tab:sensitivity}
    {\caption{\textbf{Quantitative results of the sensitivity study} on $\lambda_{11-12}$ on the {\em pulling} sequence. Best results in \textbf{bold}.}}
    {\centering
    \begin{tabular}{cccc}
    \hline
    $\lambda$ values& PSNR$\uparrow$ & SSIM $\uparrow$& LPIPS $\downarrow$ \\
    
    \hline
    0.5 & \textbf{37.667} & \textbf{0.956} & \textbf{0.046} \\
    0.1  & 36.499 & 0.951 & 0.054  \\
    0.01  & 35.654 & 0.946 & 0.0632   \\
    \hline
    \end{tabular}
    }
\end{table}

\begin{figure}[h!]
\floatconts
    {fig:sensitivity}
    {\caption{\textbf{Qualitative results of the sensitivity study} on $\lambda_{11-12}$ on the {\em pulling} sequence. 
    \textbf{Row 1:} RGB images. \textbf{Row 2:} Surface depth. \textbf{Row 3:} Surface normal.}}
    {\centering
    \setlength{\tabcolsep}{2pt}
    \begin{tabular}[C]{c|ccc}
        Reference & $\lambda_{11,12} = 0.01$& $\lambda_{11,12} = 0.1$ & $\lambda_{11,12} = 0.5$\\
        \begin{minipage}[c]{0.12\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/sensitivity/gt-images/frame-000040.color.png}
        \end{minipage}&
        \begin{minipage}[c]{0.12\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/sensitivity/0_01/renders/00000.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.12\textwidth}
        \vspace{0.1cm}
          \includegraphics[ width=\textwidth]{images/sensitivity/0_1/renders/00000.png}
        \end{minipage}&
        \begin{minipage}[c]{0.12\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/sensitivity/0_5/renders/00000.png}
        \end{minipage} \\

        \begin{minipage}[c]{0.12\textwidth}
        \vspace{0.1cm}
        
          \includegraphics[width=\textwidth]{images/sensitivity/gt-depth/frame_0040.png}
        \end{minipage}&
        \begin{minipage}[c]{0.12\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/sensitivity/0_01/depth/00000_aligned.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.12\textwidth}
        \vspace{0.1cm}
          \includegraphics[ width=\textwidth]{images/sensitivity/0_1/depth/00000_aligned.png}
        \end{minipage}&
        \begin{minipage}[c]{0.12\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/sensitivity/0_5/depth/00000_aligned.png}
        \end{minipage}        \\
        \begin{minipage}[c]{0.12\textwidth}
        \vspace{0.1cm}
        \includegraphics[width=\textwidth]{images/pulling/rend_normalCANON.png}
        \end{minipage}&
        \begin{minipage}[c]{0.12\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/sensitivity/0_01/surf_normal_00000.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.12\textwidth}
        \vspace{0.1cm}
          \includegraphics[ width=\textwidth]{images/sensitivity/0_1/surf_normal_00000.png}
        \end{minipage}&
        \begin{minipage}[c]{0.12\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/sensitivity/0_5/surf_normal_00000.png}
        \end{minipage}
    \end{tabular} }
\end{figure}

\subsection{Final results}
\paragraph{EndoNeRF dataset.}
We provide a quantitative and qualitative assessment of the performance of final version of our Endo-4DTS. We compare our results with other methods that apply NeRF-based or Gaussian-based approaches to the Endo-NeRF~\cite{endoNERF} dataset in \tableref{tab:finalcomparison}. We cannot directly compare our results with any other method using triangle splatting~\cite{trianglesplatting26} as we are the first, to our knowledge, to extend this recent work to dynamic scenes. 


\begin{table}[htbp]
\floatconts
    {tab:finalcomparison}
    {\caption{\textbf{Quantitative comparison} of our method Endo-4DTS with EndoNeRF~\cite{endoNERF}, EndoSurf~\cite{endoSURF}, LerPlane~\cite{lerplane}, Endo-4DGS~\cite{endo4dgs}, EndoGaussian~\cite{endogaussian}, and SurgicalGaussian~\cite{surgicalgaussian} on the EndoNeRF~\cite{endoNERF} dataset. Best results are highlighted in \textbf{bold}.}}
    {\centering
    
    \begin{tabular}{c|ccc|ccc}
    \hline
    \multirow{2}{*}{Methods} & \multicolumn{3}{c|}{``pulling''} & \multicolumn{3}{c}{``cutting''} \\
    \cline{2-7}
    & PSNR$\uparrow$ & SSIM $\uparrow$& LPIPS $\downarrow$ & PSNR $\uparrow$ & SSIM $\uparrow$& LPIPS $\downarrow$\\
    \hline
    EndoNeRF      & 34.217 & 0.938 & 0.160 & 34.186 & 0.932 & 0.151 \\
    EndoSurf      & 35.004 & 0.956 & 0.120 & 34.981 & 0.953 & 0.106 \\
    LerPlane     & 36.241 & 0.950 & 0.102 & 35.580 & 0.955 & 0.101 \\
    Endo-4DGS& 36.56 & 0.955 & 0.032&37.85& 0.959 &0.043\\
    EndoGaussian  & 37.308 & 0.958 & 0.070 & 38.287 & 0.962 & 0.058 \\
    SurgicalGaussian & 38.783 & 0.970 & 0.049 & 37.505 & 0.961 & 0.062 \\
    Endo-4DTS (Ours) & \textbf{40.39} &	\textbf{0.971} &\textbf{0.026} & \textbf{38.876} & \textbf{0.966} & \textbf{0.029}\\
    \hline
    \end{tabular}
    }
\end{table}


As can be seen, our method consistently outperforms all other approaches, demonstrating its superior representation capacity and rendering quality. The improvement is particularly pronounced in the LPIPS metric, which highlights the ability of our Endo-4DTS to generate more visually realistic images. The qualitative results in \figureref{final-pulling} and \figureref{final-cutting} further confirm these findings, capturing details like tissue capillaries and specular highlights with high-fidelity, without compromising the geometric properties of the scene.

\begin{figure}[h!]
\floatconts
    {final-pulling}
    {\caption{\textbf{Rendered results of Endo-4DTS model on five frames of the {\em pulling} video.} \textbf{Row 1:} RGB input. \textbf{Row 2:} RGB output. \textbf{Row 3:} Depth input. \textbf{Row 4:} Depth output.}}
    {\centering
    \setlength{\tabcolsep}{2pt}
    \begin{tabular}[C]{ccccc}
        
        \begin{minipage}[c]{0.14\textwidth}
        
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/pulling/gt-image/frame-000000.color.png}
        \end{minipage}&
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/pulling/gt-image/frame-000009.color.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[ width=\textwidth]{images/pulling/gt-image/frame-000027.color.png}
        \end{minipage}&
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/pulling/gt-image/frame-000036.color.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/pulling/gt-image/frame-000054.color.png}
        \end{minipage} \\
        
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/pulling/render/frame-00000.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[ width=\textwidth]{images/pulling/render/frame-00009.png}
        \end{minipage}&
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/pulling/render/frame-00027.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/pulling/render/frame-00036.png}
        \end{minipage}&
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/pulling/render/frame-00054.png}
        \end{minipage}\\
        
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/pulling/gt-depth/frame_0000.png}
        \end{minipage}&
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/pulling/gt-depth/frame_0009.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[ width=\textwidth]{images/pulling/gt-depth/frame_0027.png}
        \end{minipage}&
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/pulling/gt-depth/frame_0036.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/pulling/gt-depth/frame_0054.png}
        \end{minipage} \\

        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/pulling/depth/png00000_aligned.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[ width=\textwidth]{images/pulling/depth/png00009_aligned.png}
        \end{minipage}&
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/pulling/depth/png00027_aligned.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/pulling/depth/png00036_aligned.png}
        \end{minipage}&
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/pulling/depth/png00054_aligned.png}
        \end{minipage}
    \end{tabular}}
\end{figure}

\begin{figure}[H]
\floatconts
    {final-cutting}
    {\caption{\textbf{Rendered results of Endo-4DTS model on five frames of the {\em cutting} video.} \textbf{Row 1:} RGB input. \textbf{Row 2:} RGB output. \textbf{Row 3:} Depth input. \textbf{Row 4:} Depth output.}}
    {\centering
    \setlength{\tabcolsep}{2pt}
    \begin{tabular}[C]{ccccc}
        
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/cutting/gt-image/000013.png}
        \end{minipage}&
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/cutting/gt-image/000039.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[ width=\textwidth]{images/cutting/gt-image/000065.png}
        \end{minipage}&
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/cutting/gt-image/000091.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/cutting/gt-image/000117.png}
        \end{minipage} \\
        
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/cutting/render/0000013.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[ width=\textwidth]{images/cutting/render/39.png}
        \end{minipage}&
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/cutting/render/65.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/cutting/render/91.png}
        \end{minipage}&
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/cutting/render/117.png}
        \end{minipage}\\
        
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/cutting/gt-depth/frame_0013.png}
        \end{minipage}&
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/cutting/gt-depth/frame_0039.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[ width=\textwidth]{images/cutting/gt-depth/frame_0065.png}
        \end{minipage}&
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/cutting/gt-depth/frame_0091.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/cutting/gt-depth/frame_0117.png}
        \end{minipage} \\

        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/cutting/depth/png00013_aligned.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[ width=\textwidth]{images/cutting/depth/png00039_aligned.png}
        \end{minipage}&
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/cutting/depth/png00065_aligned.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/cutting/depth/png00091_aligned.png}
        \end{minipage}&
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/cutting/depth/png00117_aligned.png}
        \end{minipage}
    \end{tabular}}
\end{figure}

However, while the estimated photometric properties of the scene are remarkable, the surface depth estimation presents small localized inconsistencies, with a small number of triangles that are not perfectly aligned with the surrounding surface. These triangles are typically displaced only slightly in front of or behind the main surface, rather than appearing at extreme depths. Such errors in the estimation of the geometric properties may arise because the MLP tends to prioritize photometric accuracy over the depth regularization and other geometry-related objectives during optimization. Additionally, we observe a larger number of misplaced triangles around the moving surgical tools, which occlude the tissues behind them and make it more difficult for the MLP to learn where to correctly place those triangles when a new part of the tissue--at a considerable different depth than the tool--is revealed.
Moreover, the densification strategy could introduce some undesired noise to the triangles, given that it was taken directly from triangle splatting~\cite{trianglesplatting26} which only considers rigid scenes. The design of densification and pruning strategies specifically tailored for deformable scenes is an interesting and promising line for future research, which we plan to explore. In the future, we will also consider more sophisticated geometric priors that constrain the motion of triangles, in order to simultaneously guarantee both geometric and photometric properties.

\paragraph{NeRFscopy dataset.}
To evaluate the generalization capabilities of our proposed method, we additionally tested our method on the NeRFscopy dataset~\cite{salortISBI26}. \tableref{tab:extraexperiments} presents a quantitative comparison between Endo-4DTS, NeRFscopy~\cite{salortISBI26} and EndoNeRF~\cite{endoNERF}. Again Endo-4DTS consistently outperforms previous ones across all metrics by relevant margins. Qualitative results are shown in \figureref{fig:NeRFscopy}, where our method produces accurate RGB renderings and plausible geometrical properties across different scenarios.

Our method requires on average 3 to 5 hours to train on a single GPU NVIDIA RTX A5000. Training and rendering times strongly depend on both the number of triangles used to represent the scene and the resolution of the rendered images; in our experiments, we observe an average rendering speed of 17.74 FPS for the EndoNeRF~\cite{endoNERF} dataset, 26.86 FPS for the NeRFscopy~\cite{salortISBI26} dataset and 16.84 FPS for the StereoMIS~\cite{stereomis} dataset. We emphasize that the primary focus of this work was demonstrating the feasibility and effectiveness of extending triangle splatting~\cite{trianglesplatting26} to deformable endoscopy environments, rather than optimizing computational efficiency which will be addressed in future work. In any case, Endo-4DTS obtains a remarkable tradeoff between accuracy and computational efficiency.


\section{Conclusions}
We introduced Endo-4DTS, a self-supervised method for synthesis of deformable endoscopy scenes by extending triangle splatting~\cite{trianglesplatting26} to non-rigid environments through an explicit canonical representation of the scene that is jointly optimized with a deformation network modeling the triangle transformations. Our sophisticated loss design stabilizes training and enables learning an explicit, deformable representation of the scene. Experimental results demonstrate state-of-the-art performance across PSNR, SSIM, and LPIPS, with consistently sharper and coherent renderings, highlighting the superiority of our method. To the best of our knowledge, we are the first to successfully extend triangle splatting~\cite{trianglesplatting26} to deformable environments, and the first to apply it to endoscopy scenes. 

\begin{table}[h!]
\floatconts
    {tab:extraexperiments}
    {\caption{\textbf{Additional experiments} of our method Endo-4DTS, compared with EndoNeRF~\cite{endoNERF} and NeRFscopy~\cite{salortISBI26} on the NeRFscopy~\cite{salortISBI26} dataset. Best results are highlighted in \textbf{bold}.}}
    {\centering
    
    \begin{tabular}{l|l|c|c|c}
    \multicolumn{2}{c|}{}& {PSNR} $\uparrow$ &  {SSIM} $\uparrow$& {LPIPS}$\downarrow$\\\hline
    \multirow{3}{*}{TECAB1} & EndoNeRF& 25.791&  0.742& 0.255\\\cline{2-5}
    &NeRFscopy & 25.811& 0.750& 0.255\\ \cline{2-5} 
    &Endo-4DTS&  \textbf{28.26}&  \textbf{0.863}& \textbf{0.136}\\
    \hline
    \hline
    \multirow{3}{*}{TECAB2} &EndoNeRF& 24.954&  0.685& 0.419\\\cline{2-5}
    &NeRFscopy &  24.864&0.689& 0.429\\\cline{2-5}
    &Endo-4DTS & \textbf{25.983}& \textbf{ 0.783} &\textbf{0.238}\\ 
    \hline
    \hline
    \multirow{3}{*}{Lung Lobectomy} &EndoNeRF & 27.142&  0.788& 0.293\\\cline{2-5}
    &NeRFscopy&  27.285&  0.791& 0.275\\\cline{2-5}
    &Endo-4DTS&  \textbf{28.435}& \textbf{ 0.825}& \textbf{0.178}\\
    \hline
    \hline
    \multirow{3}{*}{Bronchoscopy} &EndoNeRF& 33.872&  0.867& 0.588\\\cline{2-5}
    
    &NeRFscopy&  34.405& 0.875& 0.156\\\cline{2-5}
    &Endo-4DTS& \textbf{35.566} &	\textbf{0.902}	&\textbf{0.080} \\
    \bottomrule
    \end{tabular}
    }
\end{table}

\begin{figure}[h!]
\floatconts
    {fig:NeRFscopy}
    {\caption{\textbf{Qualitative evaluation on the NeRFscopy dataset.} In both sides, the same information is displayed. Columns show from left to right: arbitrary input frame, rendered RGB output, input depth estimation, and rendered depth output. \textbf{Left:} TECAB1 and TECAB2 images. \textbf{Right:} Bronchoscopy and lung lobectomy images.}}
    {\centering
    \setlength{\tabcolsep}{2pt}
    \begin{tabular}[C]{cccc?cccc}
        \begin{minipage}[c]{0.11\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/nerfscopy/heart/input_051.png}
        \end{minipage}&
        \begin{minipage}[c]{0.11\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/nerfscopy/heart/00000.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.11\textwidth}
        \vspace{0.1cm}
          \includegraphics[ width=\textwidth]{images/nerfscopy/heart/51_depth.png}
        \end{minipage}&
        \begin{minipage}[c]{0.11\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/nerfscopy/heart/png00000_aligned.png}
        \end{minipage}&
        
        \begin{minipage}[c]{0.11\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/nerfscopy/broncho/0021.png}
        \end{minipage}&
        \begin{minipage}[c]{0.11\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/nerfscopy/broncho/00000.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.11\textwidth}
        \vspace{0.1cm}
          \includegraphics[ width=\textwidth]{images/nerfscopy/broncho/21_depth.png}
        \end{minipage}&
        \begin{minipage}[c]{0.11\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/nerfscopy/broncho/png00000_aligned.png}
        \end{minipage}  \\

        \begin{minipage}[c]{0.11\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/nerfscopy/vid21/350.png}
        \end{minipage}&
        \begin{minipage}[c]{0.11\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/nerfscopy/vid21/00000.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.11\textwidth}
        \vspace{0.1cm}
          \includegraphics[ width=\textwidth]{images/nerfscopy/vid21/350_depth.png}
        \end{minipage}&
        \begin{minipage}[c]{0.11\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/nerfscopy/vid21/png00000_aligned.png}
        \end{minipage}&
        \begin{minipage}[c]{0.11\textwidth}
        \vspace{0.1cm}
        \includegraphics[width=\textwidth]{images/nerfscopy/surgery3/input_002.png}
        \end{minipage}&
        \begin{minipage}[c]{0.11\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/nerfscopy/surgery3/00000.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.11\textwidth}
        \vspace{0.1cm}
          \includegraphics[ width=\textwidth]{images/nerfscopy/surgery3/000002_depth.png}
        \end{minipage}&
        \begin{minipage}[c]{0.11\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/nerfscopy/surgery3/png00000_aligned.png}
        \end{minipage} \\

    \end{tabular} }
\end{figure}







\clearpage  
\midlacknowledgments{This work has been supported by the project GRAVATAR PID2023-151184OB-I00 funded by MCIU/AEI/10.13039/501100011033 and by ERDF, UE; and by the Government of Catalonia under 2025 FI-STEP 00398.}


\bibliography{midl26_404}
\appendix\section{Supplementary experiments}\label{app:extra}

To further evaluate the performance of our Endo-4DTS method, we also used one sequence from the StereoMIS~\cite{stereomis}, specifically, the first 110 frames of sequence ``P2\_7'' as done in SurgicalGS~\cite{surgicalgs}. \tableref{tab:supplementary} reports a quantitative comparison between several NeRF- and Gaussian-based approaches. As shown, Endo-4DTS consistently outperforms all previous state-of-the-art methods across all evaluated metrics.

These quantitative results are supported by the qualitative results presented in \figureref{fig:extra}, where several frames of the sequence can be seen. Our method produces renderings with high photometric fidelity, accurately representing fine details and texture. Moreover, the estimated depth maps are temporally and geometrically coherent on the tissue surfaces. However, the depth around the surgical instruments exhibits more artifacts and worse geometrical quality.

\begin{table}[htbp]
\floatconts
    {tab:supplementary}
    {\caption{\textbf{Quantitative comparison} of our method Endo-4DTS with EndoNeRF~\cite{endoNERF}, EndoSurf~\cite{endoSURF}, LerPlane~\cite{lerplane}, EndoGaussian~\cite{endogaussian}, Deform3DGS\cite{deform3dgs} and SurgicalGS~\cite{surgicalgs} on the StereoMIS~\cite{stereomis} dataset. Best results are highlighted in \textbf{bold}.}}
    {\centering
    
    \begin{tabular}{c|ccc}
    \hline
    {Methods} & PSNR$\uparrow$ & SSIM $\uparrow$& LPIPS $\downarrow$ \\
    \hline
    EndoNeRF      & 28.79 &  0.809 & 0.266 \\
    EndoSurf      & 29.36 & 0.861 & 0.211 \\
    LerPlane     & 29.09 & 0.789 & 0.179 \\
    EndoGaussian& 31.02 & 0.878 & 0.132\\
    Deform3DGS  & 31.61 & 0.888 & 0.135 \\
    SurgicalGS & 31.54 & 0.885 & 0.148 \\
    Endo-4DTS (Ours) &\textbf{32.110}	&\textbf{0.903}	&\textbf{0.076}\\
    \hline
    \end{tabular}
    }
\end{table}

\begin{figure}[h!]
\floatconts
    {fig:extra}
    {\caption{\textbf{Rendered results of Endo-4DTS model on four frames of the  StereMIS~\cite{stereomis} video.} \textbf{Row 1:} RGB input. \textbf{Row 2:} RGB output. \textbf{Row 3:} Depth input. \textbf{Row 4:} Depth output.}}
    {\centering
    \setlength{\tabcolsep}{2pt}
    \begin{tabular}[C]{cccc}
        
        \begin{minipage}[c]{0.14\textwidth}
        
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/stereomis/frame-000066.color.png}
        \end{minipage}&
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/stereomis/frame-000077.color.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[ width=\textwidth]{images/stereomis/frame-000088.color.png}
        \end{minipage}&
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/stereomis/frame-000099.color.png}
        \end{minipage}
        \\
        
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/stereomis/00066.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[ width=\textwidth]{images/stereomis/00077.png}
        \end{minipage}&
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/stereomis/00088.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/stereomis/00099.png}
        \end{minipage}\\
        
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/stereomis/66_depth.png}
        \end{minipage}&
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/stereomis/77_depth.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[ width=\textwidth]{images/stereomis/88_depth.png}
        \end{minipage}&
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/stereomis/99_depth.png}
        \end{minipage} \\

        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/stereomis/00066_depth.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[ width=\textwidth]{images/stereomis/00077_depth.png}
        \end{minipage}&
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/stereomis/00088_depth.png}
        \end{minipage}& 
        \begin{minipage}[c]{0.14\textwidth}
        \vspace{0.1cm}
          \includegraphics[width=\textwidth]{images/stereomis/00099_depth.png}
        \end{minipage}
    \end{tabular}}
\end{figure}


\end{document}
