Abstract: Video depth estimation has been applied to various endoscopy tasks, such as reconstruction, navigation, and surgery. Recently, many methods focus on directly applying or adapting depth estimation foundation models to endoscopy scenes. However, these methods do not consider temporal information, leading to an inconsistent prediction. We propose Endoscopic Depth Any Video (EndoDAV) to estimate spatially accurate and temporally consistent endoscopic video depth, which significantly expands the usability of depth estimation in downstream tasks. Specifically, we parameter-efficiently finetune a video depth estimation foundation model to endoscopy scenes, utilizing a self-supervised depth estimation framework which simultaneously learns depth and camera pose. Considering the distinct characteristics of endoscopic videos compared to common videos, we further design a novel loss function and a depth alignment inference strategy to enhance the temporal consistency. Experiments on two public endoscopy datasets demonstrate that our method presents superior performance in both spatial accuracy and temporal consistency. Code is available at https://github.com/Zanue/EndoDAV.
External IDs:doi:10.1007/978-3-032-05114-1_19
Loading